How're you fitting a model made for 80 gig cards onto a GPU with 24 gigs at full quant?

ericd • today at 1:29 AM • 2 replies • view on HN

Havoc • today at 9:44 AM

He said quad 3090 not single

➕ show 1 reply

zozbot234 • today at 1:37 AM

MoE layers offload to CPU inference is the easiest way, though a bit of a drag on performance

➕ show 1 reply

alt Hacker News