They are comparing 1B Gemma to 1+1B T5Gemma 2. Obviously a model with twice more parameters can do m...

killerstorm • yesterday at 12:45 AM • 2 replies • view on HN

They are comparing 1B Gemma to 1+1B T5Gemma 2. Obviously a model with twice more parameters can do more better. Says absolutely nothing about benefits of the architecture.

Replies

yorwba • yesterday at 5:56 AM

Since the encoder weights only get used for the prefixed context and then the decoder weights take over for generation, the compute requirements should be roughly the same as for the decoder-only model. Obviously an architecture that can make use of twice the parameters in the same time is better. They should've put some throughput measurements in the paper, though...

kamranjon • yesterday at 1:15 AM

You may not have seen this part: "Tied embeddings: We now tie the embeddings between the encoder and decoder. This significantly reduces the overall parameter count, allowing us to pack more active capabilities into the same memory footprint — crucial for our new compact 270M-270M model."

➕ show 1 reply

alt Hacker News

Replies