Show HN: Three new Kitten TTS models – smallest less than 25MB

167 points • by rohan_joshi • today at 3:56 PM • 57 comments • view on HN

Kitten TTS (https://github.com/KittenML/KittenTTS) is an open-source series of tiny and expressive text-to-speech models for on-device applications. We had a thread last year here: https://news.ycombinator.com/item?id=44807868.

Today we're releasing three new models with 80M, 40M and 14M parameters.

The largest model (80M) has the highest quality. The 14M variant reaches new SOTA in expressivity among similar sized models, despite being <25MB in size. This release is a major upgrade from the previous one and supports English text-to-speech applications in eight voices: four male and four female.

Here's a short demo: https://www.youtube.com/watch?v=ge3u5qblqZA.

Most models are quantized to int8 + fp16, and they use ONNX for runtime. Our models are designed to run anywhere eg. raspberry pi, low-end smartphones, wearables, browsers etc. No GPU required! This release aims to bridge the gap between on-device and cloud models for tts applications. Multi-lingual model release is coming soon.

On-device AI is bottlenecked by one thing: a lack of tiny models that actually perform. Our goal is to open-source more models to run production-ready voice agents and apps entirely on-device.

We would love your feedback!

Comments

kevin42 • today at 4:40 PM

What I love about OpenClaw is that I was able to send it a message on Discord with just this github URL and it started sending me voice messages using it within a few minutes. It also gave me a bunch of different benchmarks and sample audio.

I'm impressed with the quality given the size. I don't love the voices, but it's not bad. Running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though.

➕ show 1 reply

vezycash • today at 6:52 PM

Would an Android app of this be able to replace the built in tts?

➕ show 1 reply

armcat • today at 6:42 PM

This is awesome, well done. Been doing lot of work with voice assistants, if you can replicate voice cloning Qwen3-TTS into this small factor, you will be absolute legends!

➕ show 1 reply

pumanoir • today at 6:39 PM

The example.py file says "it will run blazing fast on any GPU. But this example will run on CPU."

I couldn't locate how to run it on a GPU anywhere in the repo.

➕ show 1 reply

ks2048 • today at 4:46 PM

You should put examples comparing the 4 models you released - same text spoken by each.

➕ show 1 reply

magicalhippo • today at 5:31 PM

A lot of good small TTS models in recent times. Most seem to struggle hard on prosody though.

Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.

Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?

➕ show 2 replies

gabrielcsapo • today at 7:06 PM

are there plans to output text alignment?

➕ show 1 reply

altruios • today at 4:39 PM

One of the core features I look for is expressive control.

Either in the form of the api via pitch/speed/volume controls, for more deterministic controls.

Or in expressive tags such as [coughs], [urgently], or [laughs in melodic ascending and descending arpeggiated gibberish babbles].

the 25MB model is amazingly good for being 25MB. How does it handle expressive tags?

➕ show 1 reply

schopra909 • today at 6:33 PM

Really cool to see innovation in terms of quality of tiny models. Great work!

➕ show 1 reply

Remi_Etien • today at 5:49 PM

25MB is impressive. What's the tradeoff vs the 80M model — is it mainly voice quality or does it also affect pronunciation accuracy on less common words?

➕ show 1 reply

DavidTompkins • today at 5:48 PM

This would be great as a js package - 25mb is small enough that I think it'd be worth it (in-browser tts is still pretty bad and varies by browser)

➕ show 1 reply

janice1999 • today at 6:41 PM

What's the actual install size for a working example? Like similar "tiny" projects, do these models actually require installing 1GB+ of dependencies?

➕ show 2 replies

whitepaper27 • today at 6:59 PM

This is great. Demo looks awesome.

➕ show 1 reply

ks2048 • today at 4:52 PM

There's a number of recent, good quality, small TTS models.

If the author doesn't describe some detail about the data, training, or a novel architecture, etc, I only assume they just took another one, do a little finetuning, and repackage as a new product.

➕ show 2 replies

sschueller • today at 6:19 PM

I'm still looking for the "perfect" setup in order to clone my voice and use it locally to send voice replies in telegram via openclaw. Does anyone have auch a setup?

I want to be my own personal assistant...

EDIT: I can provide it a RTX 3080ti.

➕ show 3 replies

devinprater • today at 5:38 PM

A lot of these models struggle with small text strings, like "next button" that screen readers are going to speak a lot.

➕ show 1 reply

fwsgonzo • today at 4:51 PM

How much work would it be to use the C++ ONNX run-time with this instead of Python? Is it a Claudeable amount of work?

The iOS version is Swift-based.

➕ show 1 reply

great_psy • today at 4:33 PM

Thanks for working on this!

Is there any way to get those running on iPhone ? I would love to have the ability for it to read articles to me like a podcast.

➕ show 1 reply

ilaksh • today at 4:35 PM

Thanks for open sourcing this.

Is there any way to do a custom voice as a DIY? Or we need to go through you? If so, would you consider making a pricing page for purchasing a license/alternative voice? All but one of the voices are unusable in a business context.

➕ show 1 reply

Tacite • today at 4:36 PM

Is it English only?

➕ show 1 reply

wiradikusuma • today at 5:31 PM

I'm thinking of giving "voice" to my virtual pets (think Pokemon but less than a dozen). The pets are made up animals but based on real animal, like Mouseier from Mouse (something like that). Is this possible?

Tldr: generate human-like voice based on animal sound. Anyway maybe it doesn't make sense.

➕ show 1 reply

devnotes77 • today at 7:01 PM

[dead]

Iamkkdasari74 • today at 6:13 PM

[dead]

alt Hacker News

Show HN: Three new Kitten TTS models – smallest less than 25MB

Comments