It's a struggle to get LLMs to generate tests that aren't entirely stupid. Like grepping...

colechristensen • yesterday at 10:08 PM • 3 replies • view on HN

It's a struggle to get LLMs to generate tests that aren't entirely stupid.

Like grepping source code for a string. or assert(1==1, true)

You have to have a curated list of every kind of test not to write or you get hundreds of pointless-at-best tests.

Replies

What I've observed in computational fluid dynamics is that LLMs seem to grab common validation cases used often in the literature, regardless of the relevance to the problem at hand. "Lid-driven cavity" cases were used by the two vibe coded simulators I commented on at r/cfd, for instance. I never liked the lid-driven cavity problem because it rarely ever resembles an actual use case. A way better validation case would be an experiment on the same type of problem the user intends to solve. I think the lid-driven cavity problem is often picked in the literature because the geometry is easy to set up, not because it's relevant or particularly challenging. I don't know if this problem is due to vibe coders not actually having a particular use case in mind or LLMs overemphasizing what's common.

LLMs seem to also avoid checking the math of the simulator. In CFD, this is called verification. The comparisons are almost exclusively against experiments (validation), but it's possible for a model to be implemented incorrectly and for calibration of the model to hide that fact. It's common to check the order-of-accuracy of the numerical scheme to test whether it was implemented correctly, but I haven't seen any vibe coders do that. (LLMs definitely know about that procedure as I've asked multiple LLMs about it before. It's not an obscure procedure.)

➕ show 1 reply

theshrike79 • today at 7:27 AM

> You have to have a curated list of every kind of test not to write

This should be distilled into a tool. Some kind of AST based code analyser/linter that fails if it sees stupid test structures.

Just having it in plain english in a HOW-TO-TEST.md file is hit and miss.

gpm • yesterday at 10:42 PM

> have a curated list of every kind of test not to write

I've seen a lot of people interact with LLMs like this and I'm skeptical.

It's not how you'd "teach" a human (effectively). Teaching (humans) with positive examples is generally much more effective than with negative examples. You'd show them examples of good tests to write, discuss the properties you want, etc...

I try to interact with LLMs the same way. I certainly wouldn't say I've solved "how to interact with LLMs" but it seems to at least mostly work - though I haven't done any (pseudo-)scientific comparison testing or anything.

I'm curious if anyone else has opinions on what the best approach is here? Especially if backed up by actual data.

➕ show 3 replies

alt Hacker News

Replies