logoalt Hacker News

achieriusyesterday at 7:34 PM2 repliesview on HN

Out of curiosity, what sort of things have you seen it do that better fit 'autoresearch' than 'autotune' thus far? Optimizations it made that wouldn't be been surfaced by an autotune system, I suppose.


Replies

karpathyyesterday at 8:48 PM

The most recent round of autoresearch (round 2) which decreased "time to GPT-2" from 1.8 hours to 1.65 hours had some examples. I adjusted the program.md to "look at modded nanogpt project and draw inspirations from there for things to try" and it came back with a bunch of tuning, but also tried and implemented new architecture changes, some of which actually helped including the smear gate and the backout skip connection. These are not just hyperparameters, they are new PyTorch code. I'm now working on a more general system that can have a queue of ideas that could be sourced from archive papers, github repos, etc.

show 1 reply
jwilberyesterday at 9:40 PM

I see this critique about autoresearch online often, but I think it’s misplaced.

Here’s a use case that may illuminate the difference, from my own work at Nvidia. Im currently training some large sparse autoencoders, and there are issues with dead latents. Several solutions exit to help here, such as auxk, which I can certainly include and tune the relevant params as you describe. However, I have several other ideas that are much different, each of which requires editing core code (full evaluation changes, initialization strategies, architecture changes, etc.), including changes to parallelism strategies in the multi-rank environment I’m using. Moreover, based on my ideas and other existing literature, Claude can try a number of new ideas, each potentially involving more code changes.

This automated run-and-discover process is far beyond what’s possible with hyperparam search.

show 1 reply