Working on reproducible test runs to catch quality issues from LLM providers. My main goal is not ...

flutas • yesterday at 11:21 PM • 2 replies • view on HN

Working on reproducible test runs to catch quality issues from LLM providers.

My main goal is not just a "the model made code, yay!" setup, but verifiable outputs that can show degradation as percentages.

i.e. have the model make something like a connect 4 engine, and then run it through a lot of tests to see how "valid" it's solution is. Then score that solution as NN/100% accurate. Then do many runs of the same test at a fixed interval.

I have ~10 tests like this so far, working on more.

Replies

alexgandy • today at 2:05 PM

Sounds really interesting. What are you using for the tests/reports?

sebastianconcpt • yesterday at 11:53 PM

Nice. Sounds like will converge to QA as a Service

alt Hacker News

Replies