It does make sense. Nvidia chips do not promise 1,000+ tokens/s. The 80GB is external HBM, unlike Cerebras’ 44GB internal SRAM.
The whole reason Cerebras can inference a model thousands of tokens per second is because it hosts the entire model in SRAM.
There are two possible scenarios for Codex Spark:
1. OpenAI designed a model to fit exactly 44GB.
2. OpenAI designed a model that require Cerebras to chain multiple wafer chips together; IE, an 88GB or 132GB or 176GB model or more.
Both options require the entire model to fit inside SRAM.
Let's not forget the KV-cache which needs a lot of RAM too (although not as much as the model weights), and scales up linearly with sequence length.