Rough ballpark estimate:
- Amazon Bedrock serves Claude Opus 4.5 at 57.37 tokens per second: https://openrouter.ai/anthropic/claude-opus-4.5
- Amazon Bedrock serves gpt-oss-120b at 1748 tokens per second: https://openrouter.ai/openai/gpt-oss-120b
- gpt-oss-120b has 5.1B active parameters at approximately 4 bits per parameter: https://huggingface.co/openai/gpt-oss-120b
To generate one token, all active parameters must pass from memory to the processor (disregarding tricks like speculative decoding)
Multiplying 1748 tokens per second with the 5.1B parameters and 4 bits per parameter gives us a memory bandwidth of 4457 GB/sec (probably more, since small models are more difficult to optimize).
If we divide the memory bandwidth by the 57.37 tokens per second for Claude Opus 4.5, we get about 80 GB of active parameters.
With speculative decoding, the numbers might change by maybe a factor of two or so. One could test this by measuring whether it is faster to generate predictable text.
Of course, this does not tell us anything about the number of total parameters. The ratio of total parameters to active parameters can vary wildly from around 10 to over 30:
120 : 5.1 for gpt-oss-120b
30 : 3 for Qwen3-30B-A3B
1000 : 32 for Kimi K2
671 : 37 for DeepSeek V3
Even with the lower bound of 10, you'd have about 800 GB of total parameters, which does not fit into the 512 GB RAM of the M3 Ultra (you could chain multiple, at the cost of buying multiple).But you can fit a 3 bit quantization of Kimi K2 Thinking, which is also a great model. HuggingFace has a nice table of quantization vs required memory https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF
I love logical posts like this. There are other factors like mxfp4 in gpt-oss, mla in deepseek, etc.
>Amazon Bedrock serves Claude Opus 4.5 at 57.37
I checked the other Opus-4 models on bedrock:
Opus 4 - 18.56tps Opus 4.1 - 19.34tps
So they changed the active parameter count with Opus 4.5