logoalt Hacker News

alex_youngyesterday at 7:44 PM2 repliesview on HN

Clusters are almost never the right answer for most problems: https://yourdatafitsinram.net/


Replies

dantillbergyesterday at 8:05 PM

Most data problems don't need to fit in RAM.

antonvsyesterday at 8:28 PM

You're drawing an incorrect conclusion from that site. Aside from the fact that "fitting in RAM" is not the only criterion for needing a cluster, the fact that it's possible to fit data into RAM on a single machine doesn't mean that's the most cost-effective, practical, or sensible solution.

A big advantage of clusters, and horizontal scaling in general, is the ability to easily dynamically scale to meet demand.

If you're running a system on a single machine that has N GB of memory and you need to scale to N+1, what do you do? Provision a new machine and migrate everything over?

No-one operates online real-time systems like this. Clusters make it much easier and less expensive to handle this.

On top of that, it's probably true that in some pure numerical problem-count sense, "most problems" don't need a cluster, but that's misleading. It's like saying "most businesses are mom-and-pop shops." Perhaps true, but it ignores hundreds of thousands of larger businesses, or even small business that have big data needs.

There are plenty of problems that involve large amounts of data, and that's increasingly true with ML applications.

I'm at a company of ~100 people which you've probably never heard of (classified as a "small" company in government stats, so not included in the hundreds of thousands figure I mentioned above.) We have 1.9 PB of data for our main environment. When we run processes that deal with it all, the clusters scale to thousands of vCPUs and tens of terabytes of RAM.

Several processes that run daily scale to 500+ vCPUs and many TB of RAM. For the latter, the data itself could probably fit in RAM on a humongous machine, but the CPUs wouldn't fit on a single machine. And we'd have to size the machines carefully every time we start them up. Clusters can scale up dynamically according to the demands of the jobs they're executing.

show 1 reply