yeah the 27B feels like something completely different. If you use it on long context tasks it perfo...

ekianjo • yesterday at 2:56 PM • 1 reply • view on HN

yeah the 27B feels like something completely different. If you use it on long context tasks it performs WAY better than 35b-a3b

Replies

Der_Einzige • yesterday at 3:27 PM

I've been telling analysts/investors for a long time that dense architectures aren't "worse" than sparse MoEs and to continue to anticipate the see-saw of releases on those two sub-architectures. Glad to continuously be vindicated on this one.

For those who don't believe me. Go take a look at the logprobs of a MoE model and a dense model and let me know if you can notice anything. Researchers sure did.

alt Hacker News

Replies