Sounds amazing; what are the downsides that a company needs to consider? Memory bottlenecks or stora...

stogot • last Saturday at 5:59 PM • 1 reply • view on HN

Sounds amazing; what are the downsides that a company needs to consider? Memory bottlenecks or storage bus access?

Replies

One downside is that you're paying for the GPU whether you're fully using it or not. It takes big queries to saturate a GH200, and if you're only using 10% of the capacity of the GPU it doesn't really matter that it's 10x faster.

In a typical company you'll have jobs, some scheduled, some ad-hoc, at a range of sizes. Most of them won't be cost-effective to run on a GPU instance, so you need a scheduling layer that estimates the size of the job and routes it to the appropriate hardware. But now what if the job is too big to run on your GPU machine? Now we either have to scale up our GPU cluster or retry it on our more flexible CPU cluster.

And this all assumes that your jobs can be transparently run across different executors from a correctness and performance standpoint.

There are niches where this makes sense (we run the same 100TB job every day and we need to speed it up), as well and large and sophisticated internal infra teams that can manage a heterogenous cluster + scheduling systems, but it's not mass-market.

➕ show 1 reply

alt Hacker News

Replies