logoalt Hacker News

killerstormyesterday at 2:09 AM0 repliesview on HN

I have seen this part. In fact I checked the paper itself where they provide more detailed numbers: it's still almost a double of the base Gemma, reuse of embeddings and attention doesn't make that much difference as most weights are in MLP s