the thing is GLM 4.7 is easily doing the work Opus was doing for me but to run it fully you'll need a much bigger hardware than a Mac Studio. $10k buys you a lot of API calls from z.ai or Anthropic. It's just not economically viable to run a good model at home.
True — I think local inference is still far more expensive for my use case due to batching effects and my relatively sporadic, hourly usage. That said, I also didn’t expect hardware prices (RTX 5090, RAM) to rise this quickly.
You can cluster Mac Studios using Thunderbolt connections and enable RDMA for distributed inference. This will be slower than a single node but is still the best bang-for-the-buck wrt. doing inference on very-large-sized models.