Without benchmarking LLMs, you're likely overpaying

118 points • by lorey • yesterday at 7:03 PM • 67 comments • view on HN

Comments

hamiltont • yesterday at 9:14 PM

Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use boolean criteria instead, then weight manually e.g.

- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N

Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps

Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition

➕ show 6 replies

andy99 • yesterday at 9:05 PM

Depends on what you’re doing. Using the smaller / cheaper LLMs will generally make it way more fragile. The article appears to focus on creating a benchmark dataset with real examples. For lots of applications, especially if you’re worried about people messing with it, about weird behavior on edge cases, about stability, you’d have to do a bunch of robustness testing as well, and bigger models will be better.

Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.

Id be careful is all.

➕ show 3 replies

wolttam • today at 8:52 PM

I'm consistently amazed at how much some individuals spend on LLMs.

I get a good amount of non-agentic use out of them, and pay literally less than $1/month for GLM-4.7 on deepinfra.

I can imagine my costs might rise to $20-ish/month if I used that model for agentic tasks... still a very far cry from the $1000-$1500 some spend.

➕ show 1 reply

verdverm • yesterday at 8:46 PM

I'd second this wholeheartedly

Since building a custom agent setup to replace copilot, adopting/adjusting Claude Code prompts, and giving it basic tools, gemini-3-flash is my go-to model unless I know it's a big and involved task. The model is really good at 1/10 the cost of pro, super fast by comparison, and some basic a/b testing shows little to no difference in output on the majority of tasks I used

Cut all my subs, spend less money, don't get rate limited

➕ show 4 replies

Havoc • today at 6:36 PM

I’m also collecting the data my side with the hopes of later using it to fine tuning a tiny model later. Unsure whether it’ll work but if I’m using APIs anyway may as well gather it and try to bottle some of that magic of using bigger models

dizhn • today at 6:32 PM

I paid a total of 13 US Dollars for all my llm usage in about 3 years. Should I analyze my providers and see if there's room for improvement?

➕ show 2 replies

gridspy • yesterday at 9:10 PM

Wow, this was some slick long form sales work. I hope your SaaS goes well. Nice one!

iFire • today at 6:08 PM

I love the user experience for your product. You're giving a free demo with results within 5 minutes and then encourage the customer to "sign in" for more than 10 prompts.

Presumably that'll be some sort of funnel for a paid upload of prompts.

➕ show 1 reply

tantalor • today at 6:01 PM

> it's the default: You have the API already

Sorry, this just makes no sense to start off with. What do you mean?

➕ show 1 reply

empiko • yesterday at 10:37 PM

I do not disagree with the post, but I am surprised that a post that is basically explaining very basic dataset construction is so high up here. But I guess most people just read the headline?

ebla • today at 4:11 PM

Aren't you supposed to customize the prompts to the specific models?

➕ show 1 reply

deepsquirrelnet • yesterday at 9:18 PM

This is just evaluation, not “benchmarking”. If you haven’t setup evaluation on something you’re putting into production then what are you even doing.

Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.

➕ show 2 replies

petcat • yesterday at 8:45 PM

> He's a non-technical founder building an AI-powered business.

It sounds like he's building some kind of ai support chat bot.

I despise these things.

➕ show 4 replies

OutOfHere • yesterday at 9:42 PM

You don't need a fancy UI to try the mini model first.

nickphx • yesterday at 9:27 PM

ah yes... nothing like using another nondeterministic black box of nonsense to judge / rate the output of another.. then charge others for it.. lol

➕ show 1 reply

epolanski • yesterday at 9:07 PM

The author of this post should benchmark his own blog for accessibility metrics, text contrast is dreadful..

On the other hand, this would be interesting for measuring agents in coding tasks, but there's quite a lot of context to provide here, both input and output would be massive.

➕ show 2 replies

alt Hacker News

Without benchmarking LLMs, you're likely overpaying

Comments