logoalt Hacker News

csomartoday at 10:31 AM1 replyview on HN

> If anything, they predict words based on a heuristic ensemble of what word is most likely to come next in similar sentences and what word is most likely to give a final higher reward.

So... "finding the most likely next word based on what they've seen on the internet"?


Replies

andy12_today at 4:47 PM

Reinforcement learning is not done with random data found on the internet; it's done with curated high-quality labeled datasets. Although there have been approaches that try to apply reinforcement learning to pre-training[1] (to learn in an unsupervised way a predict-the-next-sentence objective), as far as I know it doesn't scale.

[1] https://arxiv.org/pdf/2509.19249