On my RTX 5090 with llama.cpp: gpt-oss 120B - 37 tok/sec (with CPU offloading, doesn't f...

kgeist • last Monday at 2:34 AM • 0 replies • view on HN

On my RTX 5090 with llama.cpp:

gpt-oss 120B - 37 tok/sec (with CPU offloading, doesn't fit in the GPU entirely)

Qwen3 32B - 65 tok/sec

Qwen3 30B-A3B - 150 tok/sec

(all at 4-bit)

alt Hacker News