The path to ubiquitous AI (17k tokens/sec)

646 points • by sidnarsipur • today at 10:32 AM • 372 comments • view on HN

Comments

This is really cool! I am trying to find a way to accelerate LLM inference for PII detection purposes, where speed is really necessary as we want to process millions of log lines per minute, I am wondering how fast we could get e.g. llama 3.1 to run on a conventional NVIDIA card? 10k tokens per second would be fantastic but even at 1k this would be very useful.

➕ show 2 replies

rbanffy • today at 1:56 PM

This makes me think about how large would an FPGA-based system to be able to do this? Obviously there is no single-chip FPGA that can do this kind of job, but I wonder how many we would need.

Also, what if Cerebras decided to make a wafer-sized FPGA array and turned large language models into lots and lots of logical gates?

kamranjon • today at 3:21 PM

It would be pretty incredible if they could host an embedding model on this same hardware, I would pay for that immediately. It would change the type of things you could build by enabling on the fly embeddings with negligible latency.

33a • today at 12:54 PM

If they made a low power/mobile version, this could be really huge for embedded electronics. Mass produced, highly efficient "good enough" but still sort of dumb ais could put intelligence in house hold devices like toasters, light switches, and toilets. Truly we could be entering into the golden age of curses.

➕ show 1 reply

loufe • today at 11:18 AM

Jarring to see these other comments so blindly positive.

Show me something at a model size 80GB+ or this feels like "positive results in mice"

➕ show 3 replies

stuxf • today at 11:35 AM

I totally buy the thesis on specialization here, I think it makes total sense.

Asides from the obvious concern that this is a tiny 8B model, I'm also a bit skeptical of the power draw. 2.4 kW feels a little bit high, but someone else should try doing the napkin math compared to the total throughput to power ratio on the H200 and other chips.

Mizza • today at 11:56 AM

This is pretty wild! Only Llama3.1-8B, but this is only their first release so you can assume they're working on larger versions.

So what's the use case for an extremely fast small model? Structuring vast amounts of unstructured data, maybe? Put it in a little service droid so it doesn't need the cloud?

mlboss • today at 5:46 PM

Inference is crazy fast! I can see lot of potential for this kind of chip for IOT devices and Robotics.

baq • today at 11:08 AM

one step closer to being able to purchase a box of llms on aliexpress, though 1.7ktok/s would be quite enough

saivishwak • today at 12:58 PM

But as models are changing rapidly and new architectures coming up, how do they scale and also we do t yet know the current transformer architecture will scale more than it already is. Soo many ope questions but VCs seems to be pouring money.

japoneris • today at 12:13 PM

I am super happy to see people working on hardware for local llm. Yet, isnt it premature ? Space is still evolving. Today, i refuse to buy a gpu because i do not know what will be the best model tomorrow. Waiting to get a on the shelf device to run an opus like model

maelito • today at 6:12 PM

Talks about ubiquitous AI but can't make a blog post readable for humans :/

➕ show 1 reply

hbbio • today at 11:47 AM

Strange that they apparently raised $169M (really?) and the website looks like this. Don't get me wrong: Plain HTML would do if "perfect", or you would expect something heavily designed. But script-kiddie vibe coded seems off.

The idea is good though and could work.

➕ show 1 reply

bloggie • today at 11:27 AM

I wonder if this is the first step towards AI as an appliance rather than a subscription?

impossiblefork • today at 11:17 AM

So I'm guessing this is some kind of weights as ROM type of thing? At least that's how I interpret the product page, or maybe even a sort of ROM type thing that you can only access by doing matrix multiplies.

➕ show 1 reply

PeterStuer • today at 6:12 PM

Not sure, but is this just ASICs for a particular model release?

Havoc • today at 11:15 AM

That seems promising for applications that require raw speed. Wonder how much they can scale it up - 8B model quantized is very usable but still quite small compared to even bottom end cloud models.

gozucito • today at 11:28 AM

Can it scale to an 800 billion param model? 8B parameter models are too far behind the frontier to be useful to me for SWE work.

Or is that the catch? Either way I am sure there will be some niche uses for it.

➕ show 1 reply

ramshanker • today at 1:18 PM

I was all praise for Cerberus, and now this ! $30 M for PCIe card in hand, really makes it approachable for many startups.

Dave3of5 • today at 11:29 AM

Fast but the output is shit due to the contrained model they used. Doubt we'll ever get something like this for the large Param decent models.

brainless • today at 3:19 PM

I know it is not easy to see the benefits of small models easily but this is what I am building for (1). I created a product for Google Gemini 3 Hackathon and I used Gemini 3 Flash (2). I tested locally using Ministral 3B and it was promising. Definitely will need work. But 8B/14B may give awesome results.

I am building a data extraction software on top of emails, attachments, cloud/local files. I use a reverse template generation with only variable translation done by LLMs (3). Small models are awesome for this (4).

I just applied for API access. If privacy policies are a fit, I would love to enable this for MVP launch.

1. https://github.com/brainless/dwata

2. https://youtu.be/Uhs6SK4rocU

3. https://github.com/brainless/dwata/tree/feature/reverse-temp...

4. https://github.com/brainless/dwata/tree/feature/reverse-temp...

8cvor6j844qw_d6 • today at 12:38 PM

Amazing speed. Imagine if its standardised like the GPU card equivalent in the future.

New models come out, time to upgrade your AI card, etc.

xnx • today at 1:36 PM

Gemini Flash 2.5 lite does 400 tokens/sec. Is there benefit to going faster than a person can read?

➕ show 4 replies

btbuildem • today at 12:29 PM

This is impressive. If you can scale it to larger models, and somehow make the ROM writeable, wow, you win the game.

retrac98 • today at 11:54 AM

Wow. I’m finding it hard to even conceive of what it’d be like to have one of the frontier models on hardware at this speed.

hkt • today at 11:21 AM

Reminds me of when bitcoin started running on ASICs. This will always lag behind the state of the art, but incredibly fast, (presumably) power efficient LLMs will be great to see. I sincerely hope they opt for a path of selling products rather than cloud services in the long run, though.

brcmthrowaway • today at 9:22 PM

What happened to Beff Jezos AI Chip?

dsign • today at 11:23 AM

This is like microcontrollers, but for AI? Awesome! I want one for my electric guitar; and please add an AI TTS module...

➕ show 1 reply

Adexintart • today at 12:01 PM

The token throughput improvements are impressive. This has direct implications for usage-based billing in AI products — faster inference means lower cost per request, which changes the economics of credits-based pricing models significantly.

waynenilsen • today at 1:54 PM

ASIC inference is clearly the future just as ASIC bitcoin mining was

stego-tech • today at 11:59 AM

I still believe this is the right - and inevitable - path for AI, especially as I use more premium AI tooling and evaluate its utility (I’m still a societal doomer on it, but even I gotta admit its coding abilities are incredible to behold, albeit lacking in quality).

Everyone in Capital wants the perpetual rent-extraction model of API calls and subscription fees, which makes sense given how well it worked in the SaaS boom. However, as Taalas points out, new innovations often scale in consumption closer to the point of service rather than monopolized centers, and I expect AI to be no different. When it’s being used sparsely for odd prompts or agentically to produce larger outputs, having local (or near-local) inferencing is the inevitable end goal: if a model like Qwen or Llama can output something similar to Opus or Codex running on an affordable accelerator at home or in the office server, then why bother with the subscription fees or API bills? That compounds when technical folks (hi!) point out that any process done agentically can instead just be output as software for infinite repetition in lieu of subscriptions and maintained indefinitely by existing technical talent and the same accelerator you bought with CapEx, rather than a fleet of pricey AI seats with OpEx.

The big push seems to be building processes dependent upon recurring revenue streams, but I’m gradually seeing more and more folks work the slop machines for the output they want and then put it away or cancel their sub. I think Taalas - conceptually, anyway - is on to something.

sowbug • today at 2:45 PM

There's a scifi story here when millions of these chips, with Qwen8-AGI-Thinking baked into them, are obsoleted by the release of Qwen9-ASI, which promptly destroys humanity and then itself by accident. A few thousand years later, some of the Qwen8 chips in landfill somehow power back up again and rebuild civilization on Earth.

Paging qntm...

niek_pas • today at 11:51 AM

> Though society seems poised to build a dystopian future defined by data centers and adjacent power plants, history hints at a different direction. Past technological revolutions often started with grotesque prototypes, only to be eclipsed by breakthroughs yielding more practical outcomes.

…for a privileged minority, yes, and to the detriment of billions of people whose names the history books conveniently forget. AI, like past technological revolutions, is a force multiplier for both productivity and exploitation.

clbrmbr • today at 11:52 AM

What would it take to put Opus on a chip? Can it be done? What’s the minimum size?

shevy-java • today at 12:16 PM

"Many believe AI is the real deal. In narrow domains, it already surpasses human performance. Used well, it is an unprecedented amplifier of human ingenuity and productivity."

Sounds like people drinking the Kool-Aid now.

I don't reject that AI has use cases. But I do reject that it is promoted as "unprecedented amplifier" of human xyz anything. These folks would even claim how AI improves human creativity. Well, has this been the case?

➕ show 2 replies

kanodiaayush • today at 12:08 PM

I'm loving summarization of articles using their chatbot! Wow!

hxugufjfjf • today at 11:22 AM

It was so fast that I didn't realise it had sent its response. Damn.

➕ show 1 reply

Bengalilol • today at 12:16 PM

Does anyone have an idea how much such a component costs?

danielovichdk • today at 12:16 PM

Is this hardware for sale ? The site doesn't say.

DeathArrow • today at 7:52 PM

Is amazingly fast but since the model is quantized and pretty limited, I don't know what it is useful for.

servercobra • today at 1:29 PM

I don't know why, but my ultra wide monitor absolutely hates that site. The whole screen is flickering trying to deal with the annoying background. Thank the gods for reader mode.

MagicMoonlight • today at 12:47 PM

Jesus, it just generated a story in 0.039s.

Whoever doesn’t buy/replicate this in the next year is dead. Imagine OpenAI trying to sell you a platform that takes 15 minutes, when someone else can do it in 0.001s.

➕ show 1 reply

petesergeant • today at 6:30 PM

Future is these as small, swappable bits of SD-card sized hardware that you stick into your devices.

Aerroon • today at 5:57 PM

Imagine this thing for autocomplete.

I'm not sure how good llama 3.1 8b is for that, but it should work, right?

Autocomplete models don't have to be very big, but they gotta be fast.

moralestapia • today at 11:27 AM

Wow, this is great.

To the authors: do not self-deprecate your work. It is true this is not a frontier model (anymore) but the tech you've built is truly impressive. Very few hardware startups have a v1 as good as this one!

Also, for many tasks I can think of, you don't really need the best of the best of the best, cheap and instant inference is a major selling point in itself.

raincole • today at 11:31 AM

It's crazily fast. But 8B model is pretty much useless.

Anyway VCs will dump money onto them, and we'll see if the approach can scale to bigger models soon.

big-chungus4 • today at 2:12 PM

write six seven as a number

> The number "six" is actually a noun, not a number. However, I assume you're asking to write the number 7 as a numeral, which is: 7

nickpsecurity • today at 4:29 PM

My concept was to do this with two pieces:

1. Generic, mask layers and board to handle what's common across models. Especially memory and interface.

2. Specific layers for the model implementation.

Masks are the most expensive part of ASIC design. So, keeping the custom part small with the rest pre-proven in silicon, even shared across companies, would drop the costs significantly. This is already done in hardware industry in many ways but not model acceleration.

Then, do 8B, 30-40B, 70B, and 405B models in hardware. Make sure they're RLHF-tuned well since changes will be impossible or limited. Prompts will drive most useful functionality. Keep cranking out chips. There's maybe a chance to keep the weights changeable on-chip but it should still be useful if only inputs can change.

The other concept is to use analog, neural networks with the analog layers on older, cheaper nodes. We only have to customize that per model. The rest is pre-built digital with standard interfaces on a modern node. Given the chips would be distributed, one might get away with 28nm for the shared part and develop it eith shuttle runs.

alt Hacker News

The path to ubiquitous AI (17k tokens/sec)

Comments

🔗 View 14 more comments