I was thinking about inhouse model inference speeds at frontier labs like Anthropic and OpenAI after...

jawon • yesterday at 9:41 PM • 4 replies • view on HN

I was thinking about inhouse model inference speeds at frontier labs like Anthropic and OpenAI after reading the "Claude built a C compiler" article.

Having higher inference speed would be an advantage, especially if you're trying to eat all the software and services.

Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.

In the predicted nightmare future where everything happens via agents negotiating with agents, the side with the most compute, and the fastest compute, is going to steamroll everyone.

Replies

Aurornis • yesterday at 10:20 PM

> Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.

They said the 2.5X offering is what they've been using internally. Now they're offering via the API: https://x.com/claudeai/status/2020207322124132504

LLM APIs are tuned to handle a lot of parallel requests. In short, the overall token throughput is higher, but the individual requests are processed more slowly.

The scaling curves aren't that extreme, though. I doubt they could tune the knobs to get individual requests coming through at 10X the normal rate.

This likely comes from having some servers tuned for higher individual request throughput, at the expense of overall token throughput. It's possible that it's on some newer generation serving hardware, too.

crowbahr • yesterday at 10:08 PM

Where on earth are you getting these numbers? Why would a SaaS company that is fighting for market dominance withhold 10x performance if they had it? Where are you getting 2.5x?

This is such bizarre magical thinking, borderline conspiratorial.

There is no reason to believe any of the big AI players are serving anything less than the best trade off of stability and speed that they can possibly muster, especially when their cost ratios are so bad.

➕ show 2 replies

stavros • yesterday at 10:12 PM

This makes no sense. It's not like they have a "slow it down" knob, they're probably parallelizing your request so you get a 2.5x speedup at 10x the price.

➕ show 1 reply

falloutx • yesterday at 10:09 PM

Thats also called slowing down default experience so users have to pay more for the fast mode. I think its the first time we are seeing blatant speed ransoms in the LLMs.

➕ show 2 replies

alt Hacker News

Replies