I wonder - could this be used for multi-tier MoE? Eg. active + most used in VRAM, often used in RAM ...

Wuzado • today at 12:07 AM • 1 reply • view on HN

I wonder - could this be used for multi-tier MoE? Eg. active + most used in VRAM, often used in RAM and less used in NVMe?

Replies

rao-v • today at 12:11 AM

Yeah I’ve often wondered why folks aren’t training two tier MoEs for VRAM + RAM. We already have designs for shared experts so it cannot be hard to implement a router that allocated 10x or 100x as often to “core” experts vs the “nice to have” experts. I suppose balancing during training is tricky but some sort of custom loss on the router layers should work.

I’ve also wondered why the routers aren’t training to be serially consistent so you can predict layers to swap into VRAM a few layers ahead to maximize available bandwidth.

➕ show 3 replies

alt Hacker News

Replies