Skimmed quickly the paper. This does not look like RL. It's a genetic algorithm. In a previous life I was working on compbio (protein structure prediction), we built 100s of such heuristic based algorithm (monte carlo simulated annealing, ga..). The moment you have a good energy function (one that provide some sort of gradient), and a fast enough sampling function (llms), you can do looots of cool optmization with sufficient compute.
I guess that's now becoming true with LLMs.
Faster LLMs -> More intelligence
Genetic algorithm is worse than gradient descent.
If variety is sought, why not beam with nice population statistic.
> This does not look like RL. It's a genetic algorithm.
couldn't you say that if you squint hard enough, GA looks like a category of RL? There are certainly a lot of similarities, the main difference being how each new population of solutions is generated. Would not at all be surprised that they're using a GA/RL hybrid.