logoalt Hacker News

simonwlast Monday at 10:41 PM0 repliesview on HN

The idea that LLMs were trained on miscellaneous scraped low quality code may have been true a year ago, but I suspect it is no longer true today

All of the major model vendors are competing on how well their models can code. The key to getting better code out of the model is improving the quality of the code that it is trained on.

Filtering training data for high quality code is easier than filtering for high quality data if other types.

My strong hunch is that the quality of code being used to train current frontier models is way higher than it was a year ago.