Does anyone know or have a guess on the size of this latest thinking models and what hardware they u...

andreybaskov • today at 8:00 PM • 2 replies • view on HN

Does anyone know or have a guess on the size of this latest thinking models and what hardware they use to run inference? As in how much memory and what quantization it uses and if it's "theoretically" possible to run it on something like Mac Studio M3 Ultra with 512GB RAM. Just curious from theoretical perspective.

Replies

threeducks • today at 9:21 PM

Rough ballpark estimate:

- Amazon Bedrock serves Claude Opus 4.5 at 57.37 tokens per second: https://openrouter.ai/anthropic/claude-opus-4.5

- Amazon Bedrock serves gpt-oss-120b at 1748 tokens per second: https://openrouter.ai/openai/gpt-oss-120b

- gpt-oss-120b has 5.1B active parameters at approximately 4 bits per parameter: https://huggingface.co/openai/gpt-oss-120b

To generate one token, all active parameters must pass from memory to the processor (disregarding tricks like speculative decoding)

Multiplying 1748 tokens per second with the 5.1B parameters and 4 bits per parameter gives us a memory bandwidth of 4457 GB/sec (probably more, since small models are more difficult to optimize).

If we divide the memory bandwidth by the 57.37 tokens per second for Claude Opus 4.5, we get about 80 GB of active parameters.

With speculative decoding, the numbers might change by maybe a factor of two or so. One could test this by measuring whether it is faster to generate predictable text.

Of course, this does not tell us anything about the number of total parameters. The ratio of total parameters to active parameters can vary wildly from around 10 to over 30:

    120 : 5.1 for gpt-oss-120b
    30 : 3 for Qwen3-30B-A3B
    1000 : 32 for Kimi K2
    671 : 37 for DeepSeek V3

Even with the lower bound of 10, you'd have about 800 GB of total parameters, which does not fit into the 512 GB RAM of the M3 Ultra (you could chain multiple, at the cost of buying multiple).

But you can fit a 3 bit quantization of Kimi K2 Thinking, which is also a great model. HuggingFace has a nice table of quantization vs required memory https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF

➕ show 1 reply

docjay • today at 9:00 PM

That all depends on what you consider to be reasonably running it. Huge RAM isn’t required to run them, that just makes them faster. I imagine technically all you'd need is a few hundred megabytes for the framework and housekeeping, but you’d have to wait for the some/most/all of the model to be read off the disk for each token it processes.

None of the closed providers talk about size, but for a reference point of the scale: Kimi K2 Thinking can spar in the big leagues with GPT-5 and such…if you compare benchmarks that use words and phrasing with very little in common with how people actually interact with them…and at FP16 you’ll need 2.9TB of memory @ 256,000 context. It seems it was recently retrained it at INT4 (not just quantized apparently) and now:

“ The smallest deployment unit for Kimi-K2-Thinking INT4 weights with 256k seqlen on mainstream H200 platform is a cluster with 8 GPUs with Tensor Parallel (TP). (https://huggingface.co/moonshotai/Kimi-K2-Thinking) “

-or-

“ 62× RTX 4090 (24GB) or 16× H100 (80GB) or 13× M3 Max (128GB) “

So ~1.1TB. Of course it can be quantized down to as dumb as you can stand, even within ~250GB (https://docs.unsloth.ai/models/kimi-k2-thinking-how-to-run-l...).

But again, that’s for speed. You can run them more-or-less straight off the disk, but (~1TB / SSD_read_speed + computation_time_per_chunk_in_RAM) = a few minutes per ~word or punctuation.

➕ show 1 reply

alt Hacker News

Replies