The number of tokens trained on is separate from the model size. Gemma 3 270M was trained on 6 tri...

acoustics • yesterday at 6:09 PM • 0 replies • view on HN

The number of tokens trained on is separate from the model size.

Gemma 3 270M was trained on 6 trillion tokens but can be loaded into a few hundred million bytes of memory.

But yeah GPT-4 is certainly way bigger than 45GB.

alt Hacker News