logoalt Hacker News

trjordanyesterday at 8:03 PM1 replyview on HN

The core of the problem is that there are a million tools that make AI better, and no ways to measure whether AI is working better.

Big companies with popular products have it. They do something between normal product analytics and chatbot evals to figure out if users are being successful in their sessions. That's the job.

But any given dev, with between 3 and 50 sessions a day? Like, I have no idea what makes the LLM better. It's all vibes.

My company has a whole stack here. Preferred harnesses, preferred models, skills, the shape of our code, everything. There's gotta be a way to measure whether this setup is working for us, at 1 / 1-million-th the scale of a Claude Code.


Replies

jahalayesterday at 9:37 PM

There is an answer- these tools should benchmark by cost per correct answer - not just tokens saved.