logoalt Hacker News

aurareturntoday at 11:58 AM1 replyview on HN

It does make sense. Nvidia chips do not promise 1,000+ tokens/s. The 80GB is external HBM, unlike Cerebras’ 44GB internal SRAM.

The whole reason Cerebras can inference a model thousands of tokens per second is because it hosts the entire model in SRAM.

There are two possible scenarios for Codex Spark:

1. OpenAI designed a model to fit exactly 44GB.

2. OpenAI designed a model that require Cerebras to chain multiple wafer chips together; IE, an 88GB or 132GB or 176GB model or more.

Both options require the entire model to fit inside SRAM.


Replies

woadwarrior01today at 1:14 PM

Let's not forget the KV-cache which needs a lot of RAM too (although not as much as the model weights), and scales up linearly with sequence length.