Reinforcement learning from self-play/AlphaWhatever? Nah must just be datasets. :)

ctoth • yesterday at 5:14 PM • 4 replies • view on HN

Replies

NitpickLawyer • yesterday at 5:21 PM

And architecture stuff like actually useful long context. Whatever they did with gemini 2.5 is miles ahead in long context useful results compared to the previous models. I'd be very surprised if gemini 2.5 is "just" gemini 1 w/ better data.

➕ show 1 reply

grumpopotamus • yesterday at 5:17 PM

https://en.wikipedia.org/wiki/TD-Gammon

➕ show 2 replies

energy123 • today at 1:36 AM

Self-play gives you a large explosion of data.

nyrikki • yesterday at 6:03 PM

Big difference between a perfect information, completely specified zero sum game and the real world.

As a simple analogy, read out the following sentence multiple times, stressing a different word each time.

"I never said she stole my money"

Note how the meaning changes and is often unique?

That is a lens I to the frame problem and it's inverse, the specification problem.

The above problem quickly becomes tower-complete, and recent studies suggest that RL is reinforcing or increasing the weight of existing patterns.

As the open domain frame problem and similar challenges are equivalent to HALT, finding new ways to extract useful information will be important for generalization IMHO.

Synthetic data is useful, but not a complete solution, especially for tower problems.

➕ show 1 reply

alt Hacker News

Replies