> The first thing to do is get a lot of game data. This proved more difficult than I thought it would be, but after some looking around online I found a git repository on GitHub from rozim that had plenty of games. I used this to compile a set of 3.46GB of data, which is about twice what Tom used in his test. The next step is to get all that data into our pipeline.
It would be interesting to redo the benchmark but with a (much) larger database.
Nowadays the biggest open-data for chess must comes from Lichess https://database.lichess.org, with ~7B games and 2.34 TB compressed, ~14TB uncompressed.
Would Hadoop win here?
If you get all the data on fast SSDs in a single chassis, you probably still beat EMR over S3. But then you have a whole dedicated server to manage your 14TB of chess games.
The "EMR over S3" paradigm is based on the assumption that the data isn't read all that frequently, 1-10x a day typically, so you want your cheap S3 storage but once in a while you'll want to crank up the parallelism to run a big report over longer time periods.
> Would Hadoop win here?
Hadoop never wins. Its the worst of all possible worlds.
Probably not.
The compressed data can fit onto a local SSD. Decompression can definitely be streamed.
Almost certainly not. You can go on AWS or GCP and spin up a VM with 2.2 TB RAM and 288 vCPUs. Worst case, if streaming the data sequentially isn't fast enough, you can use something like GNU Parallel to launch processes in parallel to use all those 288 cpus. (It's also extremely easy to set up - 'apt install parallel' is about all you need.) That starts to resemble Hadoop, if you squint, except that it's all running on the same machine. As a result, it's going to outperform Hadoop significantly.
The only reason not to do that is if for some reason the workload won't support that kind of out-of-the-box parallelism. But in that case, you'd be writing custom code for Hadoop or Spark anyway, so there's an argument for doing the same to run on a single VM. These days it's pretty easy to essentially vibe code a custom script to do what you need.
At the company I'm with, we use Spark and Apache Beam for many of our large workloads, but that's typically involving data at the petabyte scale. If you're just dealing with a few dozen terabytes, it's often faster and easier to spin up a single large VM. I just ran a process on Friday like that, on a 96-core VM with 350 GB RAM.
what would you calculate in the data?
I could be tempted to do some fun on an NVL72 ;-)
It depends on what you were trying to with the data. Hadoop would never win, but Spark can allow you to hold all that data in memory across multiple machines and perform various operations on it.
If all you wanted to do was filter the dataset for certain fields, you can likely do something faster programmatically on a single machine.