No one runs SOTA models 24/7 for individual use or even for a single household or small busines...

zozbot234 • yesterday at 9:29 PM • 1 reply • view on HN

No one runs SOTA models 24/7 for individual use or even for a single household or small business, whereas you can run your own hardware basically 24/7 for AI inference.

With the new DeepSeek V4 series and its uniquely memory-light KV cache you can even extend this to parallel inference in order to hide memory bandwidth bottlenecks and increase compute intensity.

This is perhaps not so useful on a 128GB or 96GB RAM Apple Silicon device (I've seen recent reports of DS4 runs with even one agent flow hitting serious thermal and power limits on these devices, so increasing compute intensity will probably not be helpful there) but it will become useful with 64GB devices or lower that have to stream from a slow disk, or with things like the DGX Spark or to a lesser extent Strix Halo, that greatly overprovision compute while being bottlenecked on memory bandwidth.

Replies

dakolli • yesterday at 9:38 PM

Please go let me know what that's actually useful for other than spawning your next AI girlfriend to role play with.

alt Hacker News

Replies