>the GPT-OSS models are also quite good I recently pitted gpt-oss 120b against Qwen3-Next 80b o...

kgeist • yesterday at 12:19 AM • 1 reply • view on HN

>the GPT-OSS models are also quite good

I recently pitted gpt-oss 120b against Qwen3-Next 80b on a lot of internal benchmarks (for production use), and for me, gpt-oss was slightly slower (vLLM, both fit in VRAM), much worse at multilingual tasks (33 languages evaluated), and had worse instruction following (e.g., Qwen3-Next was able to reuse the same prompts I used for Gemma3 perfectly, while gpt-oss struggled and RAG benchmarks suddenly went from 90% to 60% without additional prompt engineering).

And that's with Qwen3-Next being a random unofficial 4-bit quant (compared to gpt-oss having native support) + I had to disable multi-token prediction in Qwen3-Next because vLLM crashed with it.

Has someone here tried both gpt-oss 120b and Qwen3-Next 80b? Maybe I was doing something wrong because I've seen a lot of people praise gpt-oss.

Replies

scrlk • yesterday at 2:12 AM

gpt-oss is STEM-maxxed, so I imagine most of the praise comes from people using it for agentic coding.

> We trained the models on a mostly English, text-only dataset, with a focus on STEM, coding, and general knowledge.

https://openai.com/index/introducing-gpt-oss/

alt Hacker News

Replies