Have you tried GPT-OSS-120b MXFP4 with reasoning effort set to high? Out of all models I can run within 96GB, it seems to consistently give better results. What exact llama model (+ quant I suppose) is it that you've had better results against, and what did you compare it against, the 120b or 20b variant?
How are you running this? I've had issues with Opencode formulating bad messages when the model runs on llama.cpp. Jinja threw a bunch of errors and GPT-OSS couldn't make tool calls. There's an issue for this on Opencode's repo but seems like it's been waiting or a couple of weeks.
> What exact llama model (+ quant I suppose) is it that you've had better results against
Not llama, but Qwen3-coder-next is on top of my list right now. Q8_K_XL. It's incredible (not just for coding).