I wonder about the practicalities of improving this. Say you have "acquired" all of the public internet code. Focus on just Python and Javascript. There are solid linters for these languages - automatically flag any code with a trivial SQL injection and exclude it from a future training set. Does this lead to a marked improvement in code quality? Or is the naive string concatenation approach so obvious and simple that a LLM will still produce such opportunities without obvious training material (inferred from blogs or other languages)?
You could even take it a step further. Run a linting check on all of the source - code with a higher than X% defect rate gets excluded from training. Raise the minimum floor of code quality by tossing some of the dross. Which probably leads to a hilarious reduction in the corpus size.
I wonder about the practicalities of improving this. Say you have "acquired" all of the public internet code. Focus on just Python and Javascript. There are solid linters for these languages - automatically flag any code with a trivial SQL injection and exclude it from a future training set. Does this lead to a marked improvement in code quality? Or is the naive string concatenation approach so obvious and simple that a LLM will still produce such opportunities without obvious training material (inferred from blogs or other languages)?
You could even take it a step further. Run a linting check on all of the source - code with a higher than X% defect rate gets excluded from training. Raise the minimum floor of code quality by tossing some of the dross. Which probably leads to a hilarious reduction in the corpus size.