logoalt Hacker News

daemonologistyesterday at 5:32 PM1 replyview on HN

There are also significant economies of scale (namely: utilization and batching), which tend to make inference on a shared server more economical even after the operator takes a cut.


Replies

zozbot234yesterday at 7:28 PM

You can use batching on consumer hardware, it just requires a KV-cache efficient model (or short context only) and keeping multiple inference flows running in parallel. This is most useful in combination with streamed inference, since the compute intensity of decode with those newer KV-compressed models is high enough that you have limited compute headroom when running at the speed of RAM.