logoalt Hacker News

killerstormyesterday at 12:45 AM2 repliesview on HN

They are comparing 1B Gemma to 1+1B T5Gemma 2. Obviously a model with twice more parameters can do more better. Says absolutely nothing about benefits of the architecture.


Replies

yorwbayesterday at 5:56 AM

Since the encoder weights only get used for the prefixed context and then the decoder weights take over for generation, the compute requirements should be roughly the same as for the decoder-only model. Obviously an architecture that can make use of twice the parameters in the same time is better. They should've put some throughput measurements in the paper, though...

kamranjonyesterday at 1:15 AM

You may not have seen this part: "Tied embeddings: We now tie the embeddings between the encoder and decoder. This significantly reduces the overall parameter count, allowing us to pack more active capabilities into the same memory footprint — crucial for our new compact 270M-270M model."

show 1 reply