You may also be getting a worse result for higher cost. For a medical use case, we tested multiple...

jmathai • last Tuesday at 9:30 PM • 2 replies • view on HN

You may also be getting a worse result for higher cost.

For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary.

Still waiting on human evaluation to confirm the LLM Judge was correct.

Replies

lorey • last Tuesday at 10:16 PM

That's interesting. Similarly, we found out that for very simple tasks the older Haiku models are interesting as they're cheaper than the latest Haiku models and often perform equally well.

andy99 • last Tuesday at 9:52 PM

You obviously know what you’re looking for better than me, but personally I’d want to see a narrative that made sense before accepting that a smaller model somehow just performs better, even if the benchmarks say so. There may be such an explanation, it feels very dicey without one.

➕ show 1 reply

alt Hacker News

Replies