logoalt Hacker News

yorwbayesterday at 4:42 PM0 repliesview on HN

You also need a base model that can satisfy the verifier at least some of the time. If all attempts fail, there's nothing there to reinforce. The reinforcement-learning algorithms themselves haven't changed much, but LLMs got good enough on many problems that RL could be applied. So for any given class of problem you still need enough human data to get initial performance better than random.