NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

128 points • by sdpmas • yesterday at 5:56 PM • 26 comments • view on HN

Comments

The ensemble diversity point is underrated. Most teams pick one architecture and ship it, so the finding that architectural variation beats random seeds is interesting but hard to act on in practice. The more useful takeaway: low-data regimes expose every bad design decision you normally paper over with more tokens. It's basically a forcing function for understanding what actually drives model quality vs. what's just scale noise.

bee_rider • yesterday at 11:55 PM

> Directions we think are wide open

> Second-order optimizers and natural gradient methods

Do second order optimizers help improve data efficiency? I assumed they’d help you get to the same minimum faster (but this is way outside my wheelhouse).

➕ show 1 reply

linolevan • yesterday at 10:23 PM

There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.

[0] https://www.alphaxiv.org/abs/2509.14786

➕ show 1 reply

kseniamorph • yesterday at 9:08 PM

Curious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?

➕ show 1 reply

archermarks • yesterday at 7:23 PM

Very cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.

➕ show 1 reply

lzaborowski • yesterday at 7:52 PM

I like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed.

If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.

navvyeanand • yesterday at 7:59 PM

Amazing job!

refulgentis • yesterday at 11:31 PM

This looks awesome!!! I’m curious on the ensemble: does it mean “train 8 different models and pick the best one”? That’s what my mind jumps to, but that also seems wrong, because I assume we could just keep increasing the number of different models you train to get a win.

➕ show 1 reply

suddenlybananas • yesterday at 6:43 PM

Reminds me a fair bit of the BabyLM challenge. It would be good to give them a shout-out and see how this challenge differs.

➕ show 1 reply

aplomb1026 • today at 12:33 AM

[dead]

riajain2525 • yesterday at 8:34 PM

[flagged]

STARGA • yesterday at 11:35 PM

[flagged]

➕ show 1 reply

alt Hacker News

NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

Comments