What's happening in the comment section? How come so many cannot understand that his is running Llama 3.1 8B? Why are people judging its accuracy? It's almost a 2 years old 8B param model, why are people expecting to see Opus level response!?
The focus here should be on the custom hardware they are producing and its performance, that is whats impressive. Imagine putting GLM-5 on this, that'd be insane.
This reminds me a lot of when I tried the Mercury coder model by Inceptionlabs, they are creating something called a dLLM which is like a diffusion based llm. The speed is still impressive when playing aroun with it sometimes. But this, this is something else, it's almost unbelievable. As soon as I hit the enter key, the response appears, it feels instant.
I am also curious about Taalas pricing.
> Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.
Do we have an idea of how much a unit / inference / api will cost?
Also, considering how fast people switch models to keep up with the pace. Is there really a potential market for hardware designed for one model only? What will they do when they want to upgrade to a better version? Throw the current hardware and buy another one? Shouldn't there be a more flexible way? Maybe only having to switch the chip on top like how people upgrade CPUs. I don't know, just thinking out loudly.
I personally don't buy it, cerebras is way more advanced than this, comparing this tok/s to cerebras is disingenious.
They don't give cost figures in their blog post but they do here:
https://www.nextplatform.com/wp-content/uploads/2026/02/taal...
Probably they don't know what the market will bear and want to do some exploratory pricing, hence the "contact us" API access form. That's fair enough. But they're claiming orders of magnitude cost reduction.
> Is there really a potential market for hardware designed for one model only?
I'm sure there is. Models are largely interchangeable especially as the low end. There are lots of use cases where you don't need super smart models but cheapness and fastness can matter a lot.
Think about a simple use case: a company has a list of one million customer names but no information about gender or age. They'd like to get a rough understanding of this. Mapping name -> guessed gender, rough guess of age is a simple problem for even dumb LLMs. I just tried it on ChatJimmy and it worked fine. For this kind of exploratory data problem you really benefit from mass parallelism, low cost and low latency.
> Shouldn't there be a more flexible way?
The whole point of their design is to sacrifice flexibility for speed, although they claim they support fine tunes via LoRAs. LLMs are already supremely flexible so it probably doesn't matter.