logoalt Hacker News

vessenestoday at 12:19 AM0 repliesview on HN

So, this is either the paper of the year, or ... definitely not the paper of the year.

https://arxiv.org/pdf/2605.01172 is the current version. The money graphs are page 8 and on where they show (some weirdly thick) line charts with loss results reached in roughly 1/5 the number of steps that Adam takes, just what the blog post mentions.

They also claim holding back test data is not needed, also with more graphs.

I'm not an ML scientist, and I did not attempt to seriously parse the math. It reads to me as something precisely in that liminal space some math papers do where there's enough new terminology that actually parsing through it all is going to take real, concerted effort, possibly with mild brain damage as a risk.

Their 3d graphs of "kernel eigenstructure" also do double duty for me as totally impenetrable and possibly part of an April fool's ML paper that's hilarious to insiders. Or maybe they show something really amazing; they definitely seem to converge into a shape...What does that shape mean??? Why??? What is an eigenstructure? Is it just 3D eigenvectors of some matrices? Is it natural to have a 3D shape representing these large matrices? If not, how and why were these projected down? And why are they different colors in the paper?? You get the feel for my level of understanding.

I think it would frankly just be easier to validate this claim than parse the whole paper. If only I could understand

  > Each one-step kernel increment ηKMtSS integrates into WMS , so a sequence of one-step rate-maximizers is the greedy policy whose integral is the signal-channel content of the trajectory through G, exactly as plain SGD is the greedy step whose integral is empirical-risk descent through D. The diagonal cutoff µ2 k >σ2 k/(b−1) is the optimal first-order preconditioner for population risk on any diagonal base, and a streaming variance EMAˆst of squared gradient deviations realizes it as a one-line change to AdamW: one extra parameter-sized state vector and a per parameter gate that multiplies the standard moment update
Well enough to implement the one line update to Adam in python. I have not asked codex or claude to assist yet.

Also of note to me, they talk about grokking which I found SUUUPER fascinating when it was first reported, and have never heard about since. So I was really glad to read about it and read that there has been a little academic work on the phenomenon.

Finally, of the three models they repot results on, two are extremely tiny, the last is a DPO round on Qwen 0.5B -- if the code for that is published, I imagine it would be easy to adapt and evaluate in other regimes.