This is very cool to see - seems like soooo much efficiency waiting to be unlocked at the chip level.
What's everyone think of Taalas?
They're actually burning the LLM model into the silicon, with some onboard memory for fine-tuning. They claim huge cost / latency wins.
Super fast demo live at: https://chatjimmy.ai/
https://www.reddit.com/r/singularity/comments/1r9frzk/taalas...
It'd be cool to see more of this type of thing, but I have to imagine the ability for it to be updated to a brand-new model as new models come out is limited. If that is the case, it's going to be an extremely hard sell.
It seems technically interesting, but they seem very sparse on details. I don't know if I like the idea of a single unchanging model forever on a chip. How much more expensive would the silicon be if they used rewritable ROM for the weights? Such an arrangement would permit fine-tunes of the model it was designed for, which might minimize concerns about the model becoming outdated.
In a chatbot, 17k tok/s is a neat but nearly useless showcase. In a coding agent it is a meaningful improvement. In robotics, it could be an absolute revolution.
8B models aren't useful in general, but for specific use cases they can provide an enourmous amount of intelligence - nVidia's Tesla/Waymo competitor is a 7B LLM with a 2B diffusion model, and running that at those speeds could be an order of magnitude cheaper than existing solutions.
I think hardware like this is the future for LLM-providers once we reach a point where the models aren't advancing much any more. You could argue we're close now.
The hyperscalers like AWS will made great use of these to serve up models that will be relevant for several years. But right now, we're still seeing significant bumps in model quality every couple of months - especially with open-weight models like Deepseek/Kimi/GLM.
Until that point, though, I don't see how this is ever going to be cost effective vs general purpose hardware.
I also think we'll see miniature versions of this baked into mobile hardware for super fast and efficient on-device LLMs.
> seems like soooo much efficiency waiting to be unlocked at the chip level
Well if you are exclusively using GPUs that are general purpose, of course you leave so much efficiency on the table. That’s why Google started making TPUs more than a decade ago. I remember that kerfuffle when Google fired Timnit Gebru when Gebru’s paper used GPUs to calculate the environment impact of LLMs while ignoring the efficiency of TPUs; this basically made Jeff Dean very angry due to that wide efficiency gap.