folks have too much money than sense, gpt-oss-120b full quant runs on my quad 3090 at 100tk/sec...

segmondy • today at 12:57 AM • 3 replies • view on HN

folks have too much money than sense, gpt-oss-120b full quant runs on my quad 3090 at 100tk/sec and that's with llama.cpp, with vllm it will probably run at 150tk/sec and that's without batching.

Replies

amarshall • today at 1:16 AM

You're almost certainly (definitely, in fact) confusing the 120b and 20b models.

Aurornis • today at 2:47 AM

> gpt-oss-120b full quant runs on my quad 3090

A 120B model cannot fit on 4 x 24GB GPUs at full quantization.

Either you're confusing this with the 20B model, or you have 48GB modded 3090s.

ericd • today at 1:29 AM

How're you fitting a model made for 80 gig cards onto a GPU with 24 gigs at full quant?

➕ show 2 replies

alt Hacker News

Replies