logoalt Hacker News

MillionOClockyesterday at 7:25 PM1 replyview on HN

I hope some company trains their models so that expert switches are less often necessary just for these use cases.


Replies

zozbot234yesterday at 7:33 PM

A model "where expert switches are less necessary" is hard to tell apart from a model that just has fewer total experts. I'm not sure whether that will be a good approach. "How often to switch" also depends on how much excess RAM has been available in the system to keep layers opportunistically cached from the previous token(s). There's no one-size fits all decision.