You may not have seen this part: "Tied embeddings: We now tie the embeddings between the encode...

kamranjon • yesterday at 1:15 AM • 1 reply • view on HN

You may not have seen this part: "Tied embeddings: We now tie the embeddings between the encoder and decoder. This significantly reduces the overall parameter count, allowing us to pack more active capabilities into the same memory footprint — crucial for our new compact 270M-270M model."

Replies

killerstorm • yesterday at 2:09 AM

I have seen this part. In fact I checked the paper itself where they provide more detailed numbers: it's still almost a double of the base Gemma, reuse of embeddings and attention doesn't make that much difference as most weights are in MLP s

alt Hacker News

Replies