logoalt Hacker News

NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

128 pointsby sdpmasyesterday at 5:56 PM26 commentsview on HN

Comments

shubhamintechtoday at 2:08 AM

The ensemble diversity point is underrated. Most teams pick one architecture and ship it, so the finding that architectural variation beats random seeds is interesting but hard to act on in practice. The more useful takeaway: low-data regimes expose every bad design decision you normally paper over with more tokens. It's basically a forcing function for understanding what actually drives model quality vs. what's just scale noise.

bee_rideryesterday at 11:55 PM

> Directions we think are wide open

> Second-order optimizers and natural gradient methods

Do second order optimizers help improve data efficiency? I assumed they’d help you get to the same minimum faster (but this is way outside my wheelhouse).

show 1 reply
linolevanyesterday at 10:23 PM

There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.

[0] https://www.alphaxiv.org/abs/2509.14786

show 1 reply
kseniamorphyesterday at 9:08 PM

Curious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?

show 1 reply
archermarksyesterday at 7:23 PM

Very cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.

show 1 reply
lzaborowskiyesterday at 7:52 PM

I like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed.

If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.

navvyeanandyesterday at 7:59 PM

Amazing job!

refulgentisyesterday at 11:31 PM

This looks awesome!!! I’m curious on the ensemble: does it mean “train 8 different models and pick the best one”? That’s what my mind jumps to, but that also seems wrong, because I assume we could just keep increasing the number of different models you train to get a win.

show 1 reply
suddenlybananasyesterday at 6:43 PM

Reminds me a fair bit of the BabyLM challenge. It would be good to give them a shout-out and see how this challenge differs.

show 1 reply
aplomb1026today at 12:33 AM

[dead]

riajain2525yesterday at 8:34 PM

[flagged]

STARGAyesterday at 11:35 PM

[flagged]

show 1 reply