I built a simple (free) eval tool for my own uses (Github Gists + Model Outputs) after not being abl...

michaelbuckbee • yesterday at 10:50 AM • 2 replies • view on HN

I built a simple (free) eval tool for my own uses (Github Gists + Model Outputs) after not being able to find a suitable one in the market.

The market's being split into

1. Longitudinal LLM observability tooling

Most eval startups have gone down the route of something more like being an observability platform for LLM inference. They want to be in your stack and running the inference to collect data on performance of it.

They collect things like how often a model returns JSON that's out of spec or returns values that aren't expected as well as general timing and cost info.

2. Safety Limiting / Pentesting

Say you're doing something in the medical field or that's sensitive in some way and you want to figure out what model has the best outputs for your task that won't fly off the guardrails.

3. Simple cost + performance + quality swapping

This is what my tool does, basically lets you test if you _really_ need to be running that frontier model in a loop across a million records or if you'd be better with an older model or something else.

https://evvl.ai/

Example eval: https://giyd8stidy.evvl.io

Replies

gavinboston • yesterday at 1:07 PM

Cool project! I haven't seen that OpenRouter workflow yet (sign into OpenRouter and it creates an API key that your app can use), that looks like an interesting pattern to investigate.

My company recently built a tool that is closer to your first category, but it's an API so it doesn't have the security (supply chain) concern of being embedded in your application.

https://endpointevaluator.com

It's built to help people manage the risk of LLMs changing underneath them and drifting from their designed behavior. Traditional deterministic testing probably won't be sufficient for apps that provide nondeterministic output, like a chatbot backed by an LLM.

The point in the linked article about the challenge of selling developer tools to developers is a good one. I think the first reaction to coding agents is "let's build everything ourselves!" but the long tail of maintenance is still there and the pendulum will probably swing back to "let's stick to our knitting."

jimmypk • yesterday at 12:41 PM

[flagged]

alt Hacker News

Replies