But are those tests relevant? I tried using LLMs to write tests at work and whenever I review them I end up asking it “Ok great, passes the test, but is the test relevant? Does it test anything useful?” And I get a “Oh yeah, you’re right, this test is pointless”
We fixed this at work by instructing it to maximize coverage with minimal tests, which is closer to our coding style.
Those tests were written by people. That's why they were confident that what the LLM implemented was correct.
Yes
Skill issue... And perhaps the wrong model + harness
Keep track of test coverage and ask it to delete tests without lowering coverage by more than let’s say 0.01 percent points. If you have a script that gives it only the test coverage, and a file with all tests including line number ranges, it is more or less a dumb task it can work on for hours, without actually reading the files (which would fill context too quickly).