Depends on what you’re doing. Using the smaller / cheaper LLMs will generally make it way more fragile. The article appears to focus on creating a benchmark dataset with real examples. For lots of applications, especially if you’re worried about people messing with it, about weird behavior on edge cases, about stability, you’d have to do a bunch of robustness testing as well, and bigger models will be better.
Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.
Id be careful is all.
I'm consistently amazed at how much some individuals spend on LLMs.
I get a good amount of non-agentic use out of them, and pay literally less than $1/month for GLM-4.7 on deepinfra.
I can imagine my costs might rise to $20-ish/month if I used that model for agentic tasks... still a very far cry from the $1000-$1500 some spend.
I'd second this wholeheartedly
Since building a custom agent setup to replace copilot, adopting/adjusting Claude Code prompts, and giving it basic tools, gemini-3-flash is my go-to model unless I know it's a big and involved task. The model is really good at 1/10 the cost of pro, super fast by comparison, and some basic a/b testing shows little to no difference in output on the majority of tasks I used
Cut all my subs, spend less money, don't get rate limited
I’m also collecting the data my side with the hopes of later using it to fine tuning a tiny model later. Unsure whether it’ll work but if I’m using APIs anyway may as well gather it and try to bottle some of that magic of using bigger models
I paid a total of 13 US Dollars for all my llm usage in about 3 years. Should I analyze my providers and see if there's room for improvement?
Wow, this was some slick long form sales work. I hope your SaaS goes well. Nice one!
I love the user experience for your product. You're giving a free demo with results within 5 minutes and then encourage the customer to "sign in" for more than 10 prompts.
Presumably that'll be some sort of funnel for a paid upload of prompts.
> it's the default: You have the API already
Sorry, this just makes no sense to start off with. What do you mean?
I do not disagree with the post, but I am surprised that a post that is basically explaining very basic dataset construction is so high up here. But I guess most people just read the headline?
Aren't you supposed to customize the prompts to the specific models?
This is just evaluation, not “benchmarking”. If you haven’t setup evaluation on something you’re putting into production then what are you even doing.
Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.
> He's a non-technical founder building an AI-powered business.
It sounds like he's building some kind of ai support chat bot.
I despise these things.
You don't need a fancy UI to try the mini model first.
ah yes... nothing like using another nondeterministic black box of nonsense to judge / rate the output of another.. then charge others for it.. lol
The author of this post should benchmark his own blog for accessibility metrics, text contrast is dreadful..
On the other hand, this would be interesting for measuring agents in coding tasks, but there's quite a lot of context to provide here, both input and output would be massive.
Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use boolean criteria instead, then weight manually e.g.
- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N
Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps
Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition