logoalt Hacker News

vicchenaitoday at 4:08 AM0 repliesview on HN

the rl loop here is clever but i wonder how the reward signal degrades over time. if you're optimizing for user acceptance of suggestions, you're inevitably training on a mix of "this was actually correct" and "i accepted because editing the suggestion was more work than accepting it." that second case creates a subtle bias toward suggestions that are close-enough-to-not-bother-fixing rather than actually correct.

also curious whether they see different convergence patterns across languages. my gut says something like python where theres more stylistic variation would be harder to get a clean reward signal vs something like rust where there are fewer idiomatic ways to do things.