Sure. Verifiability is far-fetched. But say I want to produce a statistically significant evaluation...

mikaelaast • yesterday at 12:12 PM • 3 replies • view on HN

Sure. Verifiability is far-fetched. But say I want to produce a statistically significant evaluation result from this – essentially testing a piece of prose. How do I go about this, short of relying on a vague LLM-as-a-judge metric? What are the parameters?

Replies

visarga • yesterday at 5:39 PM

You 100% need to test work done by AI, if it's code it needs to pass extensive tests, if it's just a question answered, it needs to be the common conclusion of multiple independent agents. You can trust a single AI as much as a HN or reddit comment, but you can trust a committee of 4 as a real expert.

More generally I think testing AI by using its web search, code execution and ensembling is the missing ingredient to increased usage. We need to define the opposite of AI work - what validates it. This is hard, but once done you can trust the system and it becomes cheaper to change.

JamesSwift • yesterday at 9:17 PM

How would you evaluate it if the agent were not a fuzzy logic machine?

The issue isnt the LLM, its that verification is actually the hard part. In any case, its typically called “evals” and you can probably craft a test harness to evaluate these if you think about it hard enough

coldtea • yesterday at 1:04 PM

Would a structured skills file format help you evaluate the results more?

➕ show 1 reply

alt Hacker News

Replies