My friend is an exec at a US software company and they are preparing to lay off a few teams of programmers in their Eastern European locations and replacing them with a small number of US programmers + AI. He said they are much more productive and produce new features much faster.
I think this misses the forest for the trees. Working with ChatGPT is eerily similar to working with offshore Indian devs back in my enterprise days. Productive if guided explicitly but if let run wild there's lots of WTF moments.
LLMs are likely to replace outsourced devs because your employees that know the context can use LLMs to do what offshore devs did before.
A crucial factor tech industry folks tend to ignore is how much executives value predictable costs. Cloud migrations got away with this, but still had to argue fiercely, because 'the cloud' and its serverless tech had the potential to significantly decrease overall spend for unpredictable, bursty workloads.
The usual counter-argument is the operational burden, but human capital is also a relatively fixed cost. A dedicated team of 3-5 FTEs could probably handle inference ops for a F500 company.
Meanwhile, the capability delta is shrinking fast. We have more evidence that local open-source is viable with the release of DeepSeek v4, and the industry is only trending further in this direction. Especially as we rely more on test-time compute and task-specific harnesses rather than model size.
So, if you're an executive looking at a marginal but fixed operations cost, added flexibility, and a rapidly closing gap in capability, why wouldn't you just run open-source models on your own infrastructure to get those highly predictable costs? Plus, you decrease the risk of one of the frontier
I have really been trying to get local models to work. I have tried different harnesses, tooling, skills, prompts, etc. But when I compare claude code with anthropic models or codex with gpt 5.5, vs qwen, glm or gemma and the same harnesses, the frontier models come out massively ahead. I am at the point where I just don't see the point of the non-frontier models, they waste more time than they save.
I'm finding sound judgment, common sense, technical depth and breadth, a feel for the UX are skills that amplify Agentic coding. Deep knowledge of the problem domain and time with the customer (or SME's or end users) are what build these. Outsourcing this will never work, you can't put someone 12 hours ahead of the timezone your serving in front of the customer.
Great article that reinforces my own opinion but adding the cleverness of adding low cost human labor into the equation. Nice.
I spent a month comparing Gemini Ultra plan to using much lower cost DeepSeek v4 with open source coding harnesses and, spoiler alert: I was happier using the much cheaper and more environmentally friendly open models: https://marklwatson.substack.com/p/my-evaluation-of-ai-agent...
I keep seeing this narrative involving Deepseek as an example of OSS LLMs but they are subsidizing a huge amount of tokens at cost and one can easily understand why they are doing it if one is not lazy and think critically.
It's still far too costly and not effective to use Local AI that can match what the frontier models can offer, especially when the inference hardware is being heavily restricted due to geopolitical risks. Claims about local LLMs somehow putting these frontier companies a run for their money I find especially doubtful in the long run.
Tokens are getting expensive because they are beginning to corner the market and will use that advantage to limit hardware distribution within and beyond the borders.
It's more likely that some workflows will see more local LLMs but those will never be the ones that require frontier model level or beat the price that a lighter smaller version of frontier model will offer to capture that tail end
I've been saying this for a couple months now since I got decent hardware and started using my local Qwen 3.6 exclusively. I have no doubt the future for individuals and medium-sized companies is local private AI.
I've been pretty happy sticking with codex 5.4 medium. I don't see a good case for switching to 5.5 at the cost of going through my token budget quicker.
There are misaligned incentives here between users just trying to get stuff done and AI companies competing on having the "smartest" model that passes benchmarks and continuously does some nobel peace price winning stuff. It's mostly overkill for the more mundane stuff normal people actually do with them. It's nice to have the option when you need that. But defaulting to that is not economical and a bit unnecessary.
There's also a difference between smart models and bigger context windows. Most of the progress in the last year was simply the context windows getting big enough to fit all/most of the stuff needed to solve issues. Before then, you had to carefully manage the context to not run out of space and they wouldn't fit much more than small hobby projects.
With sub agents, the parent agent doesn't need to be a frontier model. It can delegate to smarter agents. And most stuff it delegates shouldn't need a frontier model. Wouldn't it be nice if it could decide on a case by case basis.
The walled gardens offered by OpenAI, Antrhopic, and others currently default to one size fits all "frontier" models. This is not sustainable. They should evolve to using smaller and effective models most of the time with complexity based escalation as needed based on either estimated complexity or when the small models fail. I'm guessing some open source based alternatives to these walled gardens are probably already heading that direction.
The irony here is that with a walled garden, these companies are selling a premium experience. But in the current market that boils down to burning billions of investor cash to keep the GPUs going without much hope on profitability. Eventually surviving companies are going to have to compete on quality, cost and margins. The smart approach would be to dynamically adapt token and context window sizes instead of blindly defaulting everything to the best possible. Don't boil the oceans for a simple email summary or a simple web UI. That stuff already worked well enough with models even a few years ago.
> (Human + an almost frontier LLM) vs Frontier LLM
I'm curious, who/what is operating the frontier LLM in this scenario?
The rest of the article is equally incoherent.
Fwiw, the cost per answer, which is what ultimately matters, is going down. In a competitive market with oss and multiple frontier labs, it is hard to maintain a premium long-term.
The big question is how subsidies vs technology improvement will play out. As we saw with Uber, selling at a loss can happen for a very long time, and technology improves relentlessly.
For reference, we publish https://botsbench.com/ that shows time and cost per answer are going down while quality is going up.
For sure true for specialized ones like MedGemma (healthcare). In my testing, the 27b model is at least on the same level as frontier, and in some cases outperforms them. 4B is insanely good too for some lighter workloads. Thanks G for working on this!
I disagree with every part of this.
Local LLMs are great and very useful but if you are claiming that their code quality is in the same ballpark as Claude Code or Codex with their best models I cannot consider you a serious person. I feel like this is analogous to the folks arguing that The Cloud is "someone else's computer." As if billions of dollars of spend gives these companies zero benefit over a Mac mini.
Regarding offshore, at least in my experience, better coding agent output is down to two factors. First, is subject matter expertise. Providing the right context to the coding agent based on the tech you are building for is beyond critical. That's the issue with the Vibe Coded slop projects. No expertise in a technology means no awareness of gotchas, React is the most obvious because the LLM default is to useEffect endlessly.
The bigger issue is that by their very nature LLMs are very sensitive to quality prompting in English. I have seen offshore devs fail endlessly because they don't have the English skills to successfully prompt the machine. That has caused more work for my US based devs to either carefully tune the work ticket so it is basically a coding agent prompt. Or to go through multi day exercises to enforce better prompting.
A single US dev with Claude Code is orders of magnitude better than typical offshore. Adding local models into the mix would make offshore completely useless. I'm sure many companies will see ballooning AI bills and expensive onshore devs and be very tempted to go to TCS or similar. I hope so, because that will give startups plenty of easy targets to disrupt.
$1100/m for an outsourced engineer… am I missing something? That’s far too low. Even juniors in South America tend to ask for at least double that number before factoring in the DeepSeek cost.
I think the biggest pull is yet to come, legislation around sovereignty and the US Cloud Act is sort of a challenge for the US hyperscalers, these local models may have more than just a price advantage against frontier labs but also policy and lobbying.
I've seen the $1000/mo engineer salary thrown around a bit and I'm not even sure where it comes from.
>frontier models are more capable than the latest from DeepSeek. But is the capability difference enough to justify a 30x price difference?
The contradiction here is that without frontier models, there'd be no foundation for models like DeepSeek to reference and catch up to. Is there an economic model that captures this kind of dynamic?
Always has been. People pay for the (not so) marginal performance gains.
Premium services need to allow enterprises to self host the services to reduce cost of inference. Another advantage is data doesn't leave the VPNs.
It's particularly funny to me, but a minor point, that this post requires me to go through some kind of cloudflare armed checkpoint to dare read about AI.
A bigger issue is this thing calls AIs better coders than people and I have tried for the past 4 months to get one of the several I looked into to consistently produce a simple event-bus backed Java monorepo going with exactly zero success. Claude even repeatedly wanted to put my login logic at the actual event bus, for some reason.
What does "better coder" _exactly_ mean at this point?
The dark mode version of the site makes the tables unreadable.
I don't see local AI taking off. Memory costs make it impractical. Deepseek API pricing is not a suitable analogue because it's not local.
> But is the capability difference enough [..]
This is the (m/b)illion dollar question, isn't it? I think there's also a question of what do you think capability is exactly, and how the difference manifests itself.
On the one hand, when something becomes "good enough" that's a clear capability threshold. On the other hand, what's the limit of those capabilities, and equally as important, how does capability reflect on reliability?
We've seen "local models" lately improve on capabilities where they're "good enough" for some tasks. Reliability of solving those tasks is a bit harder to measure/benchmark/test. It'll get better as more people work with those models. But, something I've noticed in the past ~6months is that the frontier models are gaining a lot in both the breadth of capabilities, as well as the reliability of solving those tasks that they're capable of solving. I think this is where scaling (both compute and data) is showing, and where having more compute is simply better (more parallel exploration, more training data output, more broad data, etc).
There's also the problem of benchmarking true capabilities. The popular ones are getting old, and aren't as reliable as they used to be (not even touching on the subject of benchmaxxing, just thinking about their saturation, even with honest intentions).
So the question then becomes what will users prefer? Do you get the best of the best, or the one that's good enough? There might be a market for both, honestly. Not everyone does SotA stuff. And a lot of what people used to do in a company is probably mundane enough that a "good enough" model with "good enough" reliability can probably handle (w/ some supervision ofc).
What I'm more interested in is if things like Thaalas succeed and they get to provide local hardware that runs models "burned in silicon". That would be interesting, because speed and all the advantages of local models are a "quality" on their own. For example, right now I'd pay ~1k$ for an external hdd-sized block that can run a ~32B model that's popular right now, even knowing that it can only run that model. I have no idea if that's feasible or not, if it makes sense from a financial pov. But I'd buy one. And local inference on dedicated chips doesn't need to be "oss only". I'm sure oAI / etc would probably take the risk of licensing one of their -mini / -lite models provided that the risk of the weights leaking is small enough (and it probably is).
> This keeps a ceiling on how much or how fast the frontier labs can raise prices.
I generally agree, but from a different perspective. Up till now we've seen that the 3 labs influence each other's price points. When gpt5 came out at a radically smaller price, the others lowered them as well. Now with opus being SotA for coding, w/ 5.5 close behind, they've raised them back. Google seems to follow slowly. But there's hope that, being 3 top labs + 2 trailing (xAI & Meta), there'll be pressure once again. If any of those trailing labs manage to get to SotA again, the prices will drop once more. Some people say that open source also provides a pressure here, but I'm not yet convinced of this. There's still a question of who'll serve the models, at what scales, etc.
The current closed source frontier models are more capable than the latest from DeepSeek. But is the capability difference enough to justify a 30x price difference?
"Frontier models" are caught in a financial dilemma of their own making --- they have spent such huge sums on development and as a result, they may have inadvertently priced themselves out of the market.
Energy costs are a huge factor for AI. He who has the lowest energy costs will likely be able to dictate market prices. And fossil fuels dependence doesn't look to be advantageous for AI.
I think this is a compelling argument, but I think 2 issues:
1. I remain unconvinced LocalAI can work well for majority of businesses. It looks vaguely comparable on benchmarks, but it tends to be fragile and a lot of management overhead in reality.
2. Similarly, while Deepseek is comparable to Opus/Codex on benchmarks, for agentic work at scale I definitely notice the difference. That's not to say it's not economical, just that I definitely miss the big boys when I swap.
I kind of wish this was true, because the UK would be in a great place to compete with the US. But somehow people are happy to pay 3x the salary for an engineer in SF.
Only if you don’t allow construction of local data centers
First fix your website navbar and hero on mobile that was broken, and it shows that you vibe coded a slop!!!
This is bogus.
When discussing LLM pricing, people are missing the plot. The subscription token price is 10x-40x cheaper than API pricing. Your 90$ Claude subscriptions give you close to $1000 to $4000 in equivalent API token pricing.
The second issue is that the quality of the model “operator” makes a massive difference in the outcomes. Highly skilled senior devs who know how to prompt and have high agency will outperform team people that lack motivation and foundational skills.
Lastly, there is a massive difference in capabilities, determinism, and error handling between 5T SOTA models like Opus and tiny distillations from DeepSeek that perform well only in benchmarks.