So the "E2B" and "E4B" models are actually 5B and 8B parameters. Are we really going to start referring to the "effective" parameter count of dense models by not including the embeddings?
These models are impressive but this is incredibly misleading. You need to load the embeddings in memory along with the rest of the model so it makes no sense o exclude them from the parameter count. This is why it actually takes 5GB of RAM to run the "2B" model with 4-bit quantization according to Unsloth (when I first saw that I knew something was up).
These are based on the Gemma 3n architecture so E2B only needs 2Gb for text2text generation:
https://ai.google.dev/gemma/docs/gemma-3n#parameters
You can think of the per layer-embeddings as a vector database so you can in theory serve it directly from disk.