Because most of its training data is mistakes or otherwise insecure code?

pixl97 • 08/09/2025 • 1 reply • view on HN

Replies

3eb7988a1663 • 08/09/2025

I wonder about the practicalities of improving this. Say you have "acquired" all of the public internet code. Focus on just Python and Javascript. There are solid linters for these languages - automatically flag any code with a trivial SQL injection and exclude it from a future training set. Does this lead to a marked improvement in code quality? Or is the naive string concatenation approach so obvious and simple that a LLM will still produce such opportunities without obvious training material (inferred from blogs or other languages)?

You could even take it a step further. Run a linting check on all of the source - code with a higher than X% defect rate gets excluded from training. Raise the minimum floor of code quality by tossing some of the dross. Which probably leads to a hilarious reduction in the corpus size.

➕ show 1 reply

alt Hacker News

Replies