> Technical: I started this project when the first LLMs came out. I've built extensive internal evals to understand how LLMs are performing. The hallucinations at the time were simply too frequent to passthrough this data to visitors. However, I recently re-ran my evals with Opus 4.5 and was very impressed. I am running out of scenarios that I can think/find where LLMs are bad at interpreting data.
It's nice to see an AI-centric Show HN product that uses proper evals and cares about data quality.
How did you build your initial data set that you're using for the evals? Bootstrapping a high quality data set is one of the hardest parts of really knowing how an AI product is performing.