GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2

455 points • by ModelForge • last Sunday at 3:06 PM • 96 comments • view on HN

Comments

starchild3001 • last Monday at 4:11 AM

What stood out to me is how much of gpt-oss’s “newness” isn’t about radical architectural departures, but about a careful layering of well-understood optimizations—RoPE, SwiGLU, GQA, MoE—with some slightly unusual choices (tiny sliding-window sizes, few large experts instead of many small ones, per-head attention sinks).

The MXFP4 quantization detail might be the sleeper feature here. Getting 20B running on a 16 GB consumer card, or 120B on a single H100/MI300X without multi-GPU orchestration headaches, could be a bigger enabler for indie devs and researchers than raw benchmark deltas. A lot of experimentation never happens simply because the friction of getting the model loaded is too high.

One open question I’m curious about: given gpt-oss’s design bias toward reasoning (and away from encyclopedic recall), will we start seeing a formal split in open-weight model development—specialized “reasoners” that rely on tool use for facts, and “knowledge bases” tuned for retrieval-heavy work? That separation could change how we architect systems that wrap these models.

➕ show 5 replies

7moritz7 • last Sunday at 4:09 PM

Qwen3 is substantially better in my local testing. As in, adheres to the prompt better (pretty much exactly for the 32B parameter variant, very impressive) and is more organic sounding.

In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either.

So presumably, this comes down to...

- training technique or data

- dimension

- lower number of large experts vs higher number of small experts

➕ show 5 replies

mark_l_watson • last Sunday at 6:28 PM

Wow, Sebastian Raschk's blog articles are jewels - much appreciated.

I use the get-oss and qwen3 models a lot (smaller models locally using Ollama and LM Studio) and commercial APIs for the full size models.

For local model use, I get very good results with get-oss when I "over prompt," that is, I specify a larger amount of context information than I usually do. Qwen3 is simply awesome.

Until about three years ago, I have always understood neural network models (starting in the 1980s), GAN, Recurrent, LSTM, etc. well enough to write implementations. I really miss the feeling that I could develop at least simpler LLMs on my own. I am slowly working through Sebastian Raschk's excellent book https://www.manning.com/books/build-a-large-language-model-f... but I will probably never finish it (to be honest).

➕ show 2 replies

roscas • last Sunday at 5:18 PM

From my experience, qwen3-coder is way better. I only have gpt-oss:20b installed to make a few more tests but I give it a program to make a summary of what it does and qwen3 just works in a few seconds, while gpt-oss was cancelled after 5 minuts... doing nothing.

So I just use qwen3. Fast and great ouput. If for some reason I don't get what I need, I might use search engines or Perplexity.

I have a 10GB 3080 and Ryzen 3600x with 32gb of RAM.

Qwen3-coder is amazing. Best I used so far.

➕ show 5 replies

eurekin • last Sunday at 10:25 PM

I'm still in awe that a local 3090 gpu was able to run the qwen3 coder instruct 30b-a3b exl3 q6 and...

Was able to create a sample page, tried starting a server, recognising a leftover server was running, killing it (and forced a prompt for my permission), retrying and finding out it's ip for me to open in the browser.

This isn't a demo anymore. That's actually very useful help for interns/juniors already.

➕ show 1 reply

Scene_Cast2 • last Sunday at 6:05 PM

I find it interesting that the architectures of modern open weight LLMs are so similar, and that most innovation seems to be happening on the training (data, RL) front.

This is contrary to what I've seen in a large ML shop, where architectural tuning was king.

➕ show 2 replies

gglon • last Sunday at 8:34 PM

> At the time of writing, the highest-ranking non-purely-transformer-based model on the LM Arena is Jamba, which is a transformer–state space model hybrid, at rank 96.)

Tencent's hunyuan-turbos, another hybrid, is currently ranked at 22. https://arxiv.org/abs/2505.15431

mike_hearn • last Monday at 12:48 PM

I'm really not a PyTorch expert so this is most likely a newbie error, but could someone explain to me the code in Figure 7?

The code circled as "4 x emb_dim" doesn't seem to apply a 4x multiplier anywhere. Actually, the layer definitions of fc1 and fc2 in the SwiGLU variant appear to be identical to the code in the regular feed forward block. What is making the two layers in the second code snippet different sizes to fc1 in the first?

➕ show 1 reply

dzogchen • last Monday at 8:41 AM

Say it with me: freely downloadable model weights does not mean a model is open source. https://opensource.org/ai/open-source-ai-definition

➕ show 1 reply

storus • last Sunday at 6:07 PM

In my tests, GPT-OSS-120B Q8 was close to DeepSeek R1 671B Q16 in solving graduate-level math but much faster with way fewer thinking tokens.

➕ show 1 reply

oezi • last Sunday at 8:45 PM

One question I was wondering about regarding the open models released by big labs is how much more the could improve with additional training. GPT-OSS has 2.1m hours of training, how much score improvements could we see at double that?

➕ show 2 replies

poorman • last Sunday at 8:59 PM

This article really goes into a lot of detail which is nice. gpt-oss is just not good for agentic use in my observation.

tldr; I'll save you a lot of time trying things out for yourself. If you are on a >=32 GB Mac download LMStudio and then the `qwen3-coder-30b-a3b-instruct-mlx@5bit` model. It uses ~20 GB of RAM so a 32GB machine is plenty. Set it up with opencode [1] and you're off to the races! It has great tool calling ability. The tool calling ability of gpt-oss doesn't even come close in my observations.

[1] https://opencode.ai/

➕ show 2 replies

ahmedfromtunis • last Sunday at 10:42 PM

When I visit the site I get the error "Your connection is not private". Also: "You cannot visit magazine.sebastianraschka.com right now because the website uses HSTS."

Chrome latest on Ubuntu.

➕ show 1 reply

pryelluw • last Sunday at 6:56 PM

The Qwen3 4B has been very good to use local. I barely use the online models. Web searches are now more targeted thanks to it. Don’t quite fully trust the output but it’s generally good. Mods like these will revolutionize local knowledge and automation

➕ show 1 reply

chaos_emergent • last Sunday at 8:56 PM

> This is likely because LLMs are typically trained for only a single epoch over massive datasets, which is in contrast to the multi-hundred-epoch training regimes for which dropout was first introduced.

Wait, is this true? That seems like a wild statement to make, relatively unsubstantiated?

➕ show 1 reply

homarp • last Sunday at 4:07 PM

"From GPT-2 to gpt-oss: Analyzing the Architectural Advances And How They Stack Up Against Qwen3"

alt Hacker News

GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2

Comments