logoalt Hacker News

falseAss01/21/20251 replyview on HN

from their Table-3 "the aha moment", can someone explain why the re-evaluation step worth to "aha"? It looks simply repeating the initial step in the exact same way?


Replies

mdda01/21/2025

I think the "Aha" is that the RL caused it to use an anthropomorphic tone.

One difference from the initial step is that the second time around includes the initial step and the aha comment in the context : It is, after all, just doing LLM token-wise prediction.

OTOH, the RL process means that it has potentially learned the impact of statements that it makes on the success of future generation. This self-direction makes it go somewhat beyond vanilla-LLM pattern mimicry IMHO.