logoalt Hacker News

ericdtoday at 1:29 AM2 repliesview on HN

How're you fitting a model made for 80 gig cards onto a GPU with 24 gigs at full quant?


Replies

Havoctoday at 9:44 AM

He said quad 3090 not single

show 1 reply
zozbot234today at 1:37 AM

MoE layers offload to CPU inference is the easiest way, though a bit of a drag on performance

show 1 reply