I don’t do ‘evals’, but I do process billions of tokens every month, and I’ve found these small Nvid...

wcallahan • last Monday at 11:48 PM • 5 replies • view on HN

I don’t do ‘evals’, but I do process billions of tokens every month, and I’ve found these small Nvidia models to be the best by far for their size currently.

As someone else mentioned, the GPT-OSS models are also quite good (though I haven’t found how to make them great yet, though I think they might age well like the Llama 3 models did and get better with time!).

But for a defined task, I’ve found task compliance, understanding, and tool call success rates to be some of the highest on these Nvidia models.

For example, I have a continuous job that evaluates if the data for a startup company on aVenture.vc could have overlapping/conflated two similar but unrelated companies for news articles, research details, investment rounds, etc… which is a token hungry ETL task! And I recently retested this workflow on the top 15 or so models today with <125b parameters, and the Nvidia models were among the best performing for this type of work, particularly around non-hallucination if given adequate grounding.

Also, re: cost - I run local inference on several machines that run continuously, in addition to routing through OpenRouter and the frontier providers, and was pleasantly surprised to find that if I’m a paying customer of OpenRouter otherwise, the free variant there from Nvidia is quite generous for limits, too.

Replies

selfhoster11 • today at 9:40 AM

You may want to use the new "derestricted" variants of gpt-oss. While the ostensible goal of these variants is to de-censor them, it ends up removing the models' obsession with policy and wasting thinking tokens that could be used towards actually reasoning through a problem.

kgeist • today at 12:19 AM

>the GPT-OSS models are also quite good

I recently pitted gpt-oss 120b against Qwen3-Next 80b on a lot of internal benchmarks (for production use), and for me, gpt-oss was slightly slower (vLLM, both fit in VRAM), much worse at multilingual tasks (33 languages evaluated), and had worse instruction following (e.g., Qwen3-Next was able to reuse the same prompts I used for Gemma3 perfectly, while gpt-oss struggled and RAG benchmarks suddenly went from 90% to 60% without additional prompt engineering).

And that's with Qwen3-Next being a random unofficial 4-bit quant (compared to gpt-oss having native support) + I had to disable multi-token prediction in Qwen3-Next because vLLM crashed with it.

Has someone here tried both gpt-oss 120b and Qwen3-Next 80b? Maybe I was doing something wrong because I've seen a lot of people praise gpt-oss.

➕ show 1 reply

dandelionv1bes • today at 8:59 AM

Completely agree. I was working on something with TensorRT LLM and threw Nemotron in there more on a whim. It completely mopped the floor with other models for my task (text style transfer), following joint moderation with another LLM & humans. Really impressed.

andy99 • yesterday at 11:44 PM

What do you mean about not doing evals? Just literally that you don’t run any benchmarks or do you have something against them?

➕ show 2 replies

btown • yesterday at 10:02 PM

Would you mind sharing what hardware/card(s) you're using? And is https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B... one of the ones you've tested?

➕ show 1 reply

alt Hacker News

Replies