logoalt Hacker News

unignorantyesterday at 5:20 PM0 repliesview on HN

This technique doesn't actually use RL at all! There’s no policy-gradient training, value function, or self-play RL loop like in AlphaZero/AlphaTensor/AlphaDev.

As far as I can read, the weights of the LLM are not modified. They do some kind of candidate selection via evolutionary algorithms for the LLM prompt, which the LLM then remixes. This process then iterates like a typical evolutionary algorithm.