logoalt Hacker News

why_only_15yesterday at 11:41 PM0 repliesview on HN

I think Cosmo's refutations were mostly not very useful and based on misunderstandings of what I was trying to say. This is fine and we discussed it prior to their article being published.

The point I was trying to make with "RL is only necessary once" is that you can embark on a single self-play loop getting better and better, and this will get you to something close to the frontier. Once you're at the frontier, the frontier doesn't move very much, so you have quite a while (decade?) where it's totally fine to distill from the RL games.

On correction histories -- imo I correctly described what they do. Cosmo was annoyed by the word "adapt" but what I described was the adaptation.

On SPSA -- you don't have a gradient! you don't do backprop! this is what i was trying to get at.