Oh man, that's funny to see one of my grad school class projects in that list. Takes me back. :-)
From that experience: The LLM is likely to do drastically better. Most of the prior work, mine included, took a genetic algorithm approach, but an LLM is more likely to make coherent multi-instruction modifications.
It's a shame they didn't compare against some of the standard core wars benchmarks as a way to facilitate comparisons to prior work, though. Makes it hard to say that they're better for sure. https://corewar.co.uk/bench.htm
I'm not sure if that will hold up. The LLM is not going to do anything random and that is actually a powerful component that makes original output possible.