logoalt Hacker News

RobinL10/12/20242 repliesview on HN

By way of a single example, we've been migrating recently from spark to duckdb. Our jobs are not huge, but too big for a single 'normal' machine. We've gone from a 2.5 hour runtime on a cluster of 10 machines (40,vCPU total) to a 15 minute runtime on a 32vCPU single machine. I don't know for sure, but I think this is largely because it eliminates expensive shuffles and serde. Obviously results vary hugely depending on workload, and some jobs are simply too big even for a 192 core machine. But I suspect a high proportion of workloads would be better run on single large machines nowadays


Replies

menaerus10/14/2024

A cluster of 10 machines with 40 vCPUs in total would equate to 4 vCPUs per machine. I am not familiar with Spark internals but in the realm of distributed databases such a setup would generally make no sense at all (to me). So I think you're correct that most of the overhead was caused by machine-to-machine byte juggling. 4 vCPUs is nothing.

I suspect you would be able to cut down the 2.5hr runtime dramatically even with the Spark if you just deployed it as a single instance on that very same 32vCPU machine.

conjecTech10/13/2024

Your measuring wall time, not CPU time. It may be that they are similar, but I'd suspect you aren't loading the worker nodes well. If the savings are from the reduced shuffles & serde, it's probably something you can measure. I'd be curious to see the findings.

I'm not against using simple methods where appropriate. 95% of the companies out there probably do not need frameworks like spark. I think the main argument against them is operational complexity though, not the compute overhead.