Gemma 4 12B: A unified, encoder-free multimodal model

218 points • by rvz • today at 4:04 PM • 85 comments • view on HN

Comments

The big story here is the encoder-free part, which I still don't fully understand.

> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.

That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...

> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.

➕ show 12 replies

ethanpil • today at 4:36 PM

What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?

Is it simply goodwill and/or marketing? Or am I missing something strategic?

➕ show 18 replies

spott • today at 5:35 PM

Is there a paper on this?

I'm curious how they pre-trained it... I feel like it must have had audio/image output that they chopped off.

I wonder how hard it would be to add it back on.

➕ show 1 reply

mlmonkey • today at 5:34 PM

Is there some place where we can try it before downloading the gigabytes of weights?

ComputerGuru • today at 5:08 PM

Quite aside from the architectural changes, I suppose this is the answer to why Google had such a glaring hole in the (pretrained) Gemma4 model lineup between the Gemma4 4b and Gemma4 26b models!

A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.

lxgr • today at 5:10 PM

Am I missing something or are the Ollama versions of this (https://ollama.com/library/gemma4/tags) text-only for now?

➕ show 1 reply

Havoc • today at 5:05 PM

Quite a niche release. The MoE outperforms it on score and will likely be faster thanks to lower active weights. So this really only makes sense for specific ram constrained applications that can’t fit a quantized MoE

➕ show 1 reply

Zambyte • today at 4:49 PM

Is this Mac only? Or is that an Ollama issue that it only supports this release of models on Mac? It seems like every tag with the MLX badge is only supported on Mac[0], and that includes all of the tags in this release.

[0] https://ollama.com/library/gemma4/tags

Edit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.

➕ show 3 replies

dwa3592 • today at 4:46 PM

This is a pretty good update. The demo video is a bit funny though - the tester asks to turn the release into bullet points. okay, the model obliges. then the tester says draft an email with this content. BAM! the LLM turns the content from bullets to passages even though it was not asked and it undid the last good thing that it did. i am not sure if it's an email etiquette to not put bullets in the email.

randomNumber7 • today at 4:49 PM

> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.

I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)

djyde • today at 4:52 PM

What are the use cases for these small models? Is there anyone using models of this scale in their daily life who could share their experience?

➕ show 4 replies

nickandbro • today at 4:27 PM

Wow Google is becoming the new pre Llama 4 Meta when it comes to releasing open weights models.

➕ show 3 replies

BiraIgnacio • today at 5:09 PM

using an embedder instead of a decoder is quite clever. Not sure who came up with that first but it's a cool idea.

digdugdirk • today at 5:04 PM

I do enjoy the immediate out of touch signaling with the "runs on your 16gb vram laptop" line. Because everyone has a laptop with 16gb vram, or can just pop out and buy a new one, right?

➕ show 2 replies

claysmithr • today at 5:09 PM

I don’t see the download in lm studio

zuminator • today at 4:44 PM

How does it compare with e4b, aside from being larger?

➕ show 2 replies

jdelman • today at 4:55 PM

I can’t help but wonder if this is the basis of the model they’ve helped tune for Apple.

alt Hacker News

Gemma 4 12B: A unified, encoder-free multimodal model

Comments