I think the "Aha" is that the RL caused it to use an anthropomorphic tone. One differenc...

mdda • 01/21/2025 • 0 replies • view on HN

I think the "Aha" is that the RL caused it to use an anthropomorphic tone.

One difference from the initial step is that the second time around includes the initial step and the aha comment in the context : It is, after all, just doing LLM token-wise prediction.

OTOH, the RL process means that it has potentially learned the impact of statements that it makes on the success of future generation. This self-direction makes it go somewhat beyond vanilla-LLM pattern mimicry IMHO.

alt Hacker News