logoalt Hacker News

antonvstoday at 3:08 AM0 repliesview on HN

Almost certainly not. You can go on AWS or GCP and spin up a VM with 2.2 TB RAM and 288 vCPUs. Worst case, if streaming the data sequentially isn't fast enough, you can use something like GNU Parallel to launch processes in parallel to use all those 288 cpus. (It's also extremely easy to set up - 'apt install parallel' is about all you need.) That starts to resemble Hadoop, if you squint, except that it's all running on the same machine. As a result, it's going to outperform Hadoop significantly.

The only reason not to do that is if for some reason the workload won't support that kind of out-of-the-box parallelism. But in that case, you'd be writing custom code for Hadoop or Spark anyway, so there's an argument for doing the same to run on a single VM. These days it's pretty easy to essentially vibe code a custom script to do what you need.

At the company I'm with, we use Spark and Apache Beam for many of our large workloads, but that's typically involving data at the petabyte scale. If you're just dealing with a few dozen terabytes, it's often faster and easier to spin up a single large VM. I just ran a process on Friday like that, on a 96-core VM with 350 GB RAM.