logoalt Hacker News

coder543yesterday at 10:55 PM0 repliesview on HN

Reducing the expert count after training causes catastrophic loss of knowledge and skills. Cerebras does this with their REAP models (although it is applied to the total set of experts, not just routing to fewer experts each time), and it can be okay for very specific use cases if you measure which experts are needed for your use case and carefully choose to delete the least used ones, but it doesn't really provide any general insight into how a higher sparsity model would behave if trained that way from scratch.