logoalt Hacker News

_alternator_today at 6:52 PM0 repliesview on HN

This comment will probably get buried because I’m late to the party, but I’d like to point out that while they identify a real problem, the author’s approach—using code or ASTs to validate LLM output—does not solve it.

Yes, the approach can certainly detect (some) LLM errors, but it does not provide a feasible method to generate responses that don’t have the errors. You can see at the end that the proposed solution is to automatically update the prompt with a new rule, which is precisely the kind of “vibe check” that LLMs frequently ignore. If they didn’t, you could just write a prompt that says “don’t make any mistakes” and be done with it.

You can certainly use this approach to do some RL on LLM code output, but it’s not going to guarantee correctness. The core problem is that LLMs do next-token prediction and it’s extremely challenging to enforce complex rules like “generate valid code” a priori.

As a closing comment, it seems like I’m seeing a lot of technical half-baked stuff related to LLMs these days because LLMs are good at supporting people when they have half baked ideas, and are reluctant to openly point out the obvious flaws.