can you share your use cases for 2b and 4b models?
curious how people are leveraging these models
For me, I use them for quick auto complete or small questions. I am not a vibe/agentic coder. I know I am a relic and a Luddite because of this.
Instead of hitting stack overflow and Google I will ask questions like "can you give me an example of how to do x in library y?" Or "this error is appearing what might be happening if I checked a b and c". Or "please write unit tests for this function". Or code auto complete.
I am not looking for the world's best answer from a 3b model. I am looking for a super fast answer that reminds me of things I already know or maybe just maybe gives me a fast idea to stub something while I focus on something more important, I am going to refactor anyways. Think a low quality rubber duck
I mostly use 7-9b models for this now but llama 3.2 3b is pretty decent for not hogging resources while say I have other compute heavy operations happening on a weak computer.
Probably half the questions people ask chatgpt could get roughly the same quality of answer with a small model in my opinion. You can't fully trust an LLM anyways so the difference between 60% and 70% accuracy isn't as much are marketing makes it sound like. That said the quality of a good 7-9b model is worth it compared to a 3b if your machine can run it. Furthermore the quality of qwen 36 is crazy and makes me wonder if I will ever need an AI provider again if the trend continues.
Over the weekend I used the small models for experimental training runs when figuring out how to build LoRAs. It takes a lot less time to do smoke tests of the process on E2B vs the 31B version. And E4B was a reasonable stop along the line just to make sure the LoRA combined with the base model to produce coherent output.
Also, they're good enough for a lot of simple categorization and data extraction tasks, e.g. something like "flag abusive posts/comments", or "visit website, find the contact info, open hours, address". And they run fast on the kind of hardware you're likely to have at home, while the bigger dense versions decidedly do not.
I used Gemma 4 itself to review and prune the data (my social media posts over the last ~5 years, about 5 million words) being ingested into the training process for a LoRA for Gemma 4. I found the bigger model (31B) was more nuanced and useful than the smaller ones, and I wasn't in a big hurry by that stage of the process, so I used the big one overnight. Gemma 4 31B was also a better judge of my writing than Gemini Flash 2.5, by my reckoning.
It was, again, more nuanced, and was able to recognize a generally helpful comment that opened kinda jokey/rude, while the smaller model and Gemini 2.5 Flash tended to gravitate toward extremes (1 or 5) rather than the 1-5 scale they were prompted to rate on. I assume Gemini 3.1 Flash is probably competitive or better, but I didn't try it, since I liked the results the self-hosted Gemma 4 was giving for free.
The little ones also run great on very modest hardware. Both run at comfortable interactive speed mid-range tablets. E4B is blazing fast on an iPad M4 or Pixel 10 Pro and entirely usable on a midrange Android with sufficient RAM.