logoalt Hacker News

xianshouyesterday at 3:42 PM12 repliesview on HN

Calling it now - RL finally "just works" for any domain where answers are easily verifiable. Verifiability was always a prerequisite, but the difference from prior generations (not just AlphaGo, but any nontrivial RL process prior to roughly mid-2024) is that the reasoning traces and/or intermediate steps can be open-ended with potentially infinite branching, no clear notion of "steps" or nodes and edges in the game tree, and a wide range of equally valid solutions. As long as the quality of the end result can be evaluated cleanly, LLM-based RL is good to go.

As a corollary, once you add in self-play with random variation, the synthetic data problem is solved for coding, math, and some classes of scientific reasoning. No more modal collapse, no more massive teams of PhDs needed for human labeling, as long as you have a reliable metric for answer quality.

This isn't just neat, it's important - as we run out of useful human-generated data, RL scaling is the best candidate to take over where pretraining left off.


Replies

resirosyesterday at 4:36 PM

Skimmed quickly the paper. This does not look like RL. It's a genetic algorithm. In a previous life I was working on compbio (protein structure prediction), we built 100s of such heuristic based algorithm (monte carlo simulated annealing, ga..). The moment you have a good energy function (one that provide some sort of gradient), and a fast enough sampling function (llms), you can do looots of cool optmization with sufficient compute.

I guess that's now becoming true with LLMs.

Faster LLMs -> More intelligence

show 2 replies
yorwbayesterday at 4:42 PM

You also need a base model that can satisfy the verifier at least some of the time. If all attempts fail, there's nothing there to reinforce. The reinforcement-learning algorithms themselves haven't changed much, but LLMs got good enough on many problems that RL could be applied. So for any given class of problem you still need enough human data to get initial performance better than random.

skybrianyesterday at 4:19 PM

There's no API or product yet, so it seems unlikely that they made it to a "just works" level of polish?

They are having some success in making it work internally. Maybe only the team that built it can get it to work? But it does seem promising.

modelessyesterday at 4:24 PM

IMO RL can only solve "easy" problems. The reason RL works now is that unsupervised learning is a general recipe for transforming hard problems into easy ones. But it can't go all the way to solutions, you need RL on top for that. Yann LeCun's "cherry on top" analogy was right.

smattisoyesterday at 3:58 PM

Are there platforms that make such training more streamlined? Say I have some definition of success for a given problem and it’s data how do I go about generating said RL model as fast and easily as possible?

show 1 reply
unignorantyesterday at 5:20 PM

This technique doesn't actually use RL at all! There’s no policy-gradient training, value function, or self-play RL loop like in AlphaZero/AlphaTensor/AlphaDev.

As far as I can read, the weights of the LLM are not modified. They do some kind of candidate selection via evolutionary algorithms for the LLM prompt, which the LLM then remixes. This process then iterates like a typical evolutionary algorithm.

4b11b4yesterday at 5:22 PM

This isn't quite RL, right...? It's an evolutionary approach on specifically labeled sections of code optimizing towards a set of metrics defined by evaluation functions written by a human.

I suppose you could consider that last part (optimizing some metric) "RL".

However, it's missing a key concept of RL which is the exploration/exploitation tradeoff.

TechDebtDevinyesterday at 3:54 PM

Most things are verifiable, just not with code. I'm not particularly excited for a world where everything is predictable. This is coming from a guy who loves forecasting/prediction modeling too, but one thing I hate about prediction modeling, especially from a hobbyist standpoint is data. Its very hard to get useful data. Investors will literally buy into hospital groups to get medical data for example.

There are monopolies on the coolest sets of data in almost all industries, all the RL in the world won't do us any good if those companies doing the data hoarding are only using it to forecast outcomes that will make them more money, not what can be done to better society.

spyckie2yesterday at 5:35 PM

I think you mean the general class of algorithms that scale with compute times, RL being the chief example. But yes I agree to that point.

obsolete_wagieyesterday at 4:52 PM

Yup. Its coming. Any verifiable human skill will be done by ai.