logoalt Hacker News

Setting Up a Cluster of Tiny PCs for Parallel Computing

51 pointsby speckxyesterday at 7:08 PM28 commentsview on HN

Comments

dent9today at 1:04 AM

I appreciate the author's work in doing this and writing it all up so nicely. However every time I see someone doing this, I cannot help but wonder why they are not just using SLURM + Nextflow. SLURM can easily cluster the separate computers as worker nodes, and Nextflow can orchestrate the submission of batch jobs to SLURM in a managed pipeline of tasks. The individual tasks to submit to SLURM would be the users's own R scripts (or any script they have). Combine this with Docker containers to execute on the nodes to manage dependencies needed for task execution. And possibly Ansible for the management of the nodes themselves to install the SLURM daemons and packages etc.. Taken together this creates a FAR more portable and system-agnostic and language-agnostic data analysis workflow that can seamlessly scale over as many nodes and data sets as you can shove into it. This is a LOT better than trying to write all this code in R itself that will do the communication and data passing between nodes directly. Its not clear to me that the author actually needs anything like that, and whats worse, I have seen other authors write exactly that in R and end up re-inventing the wheel of implementing parallel compute tasks (in R). Its really not that complicated. 1) write R script that takes a chunk of your data as input, processes it, writes output to some file, 2) use a workflow manager to pass in chunks up the data to discrete parallel task instances of your script / program and submit the tasks as jobs to 3) a hardware-agnostic job scheduler running on your local hardware and/or cloud resources. This is basically the backbone of HPC but it seems like a lot of people "forget" about the 'job scheduler' and 'workflow manager' parts and jump straight to glueing data-analysis code to hardware. Also important to note that most all robust workflow managers such as Nextflow also already include the parts such as "report task completion", "collect task success / failure logs", "report task CPU / memory resource usages", etc.. So that you, the end user, only need to write the parts that implement your data analysis.

mianostoday at 2:38 AM

If you want to do the same thing, build a cluster of complete machines, in a ligh-weight manner, just for the learning experience, you can use incus to create containers in the same manner. As they are all complete machines and you bridge them, you get DHCP allocation identical to running a little PC on the local LAN. If you have ssh set up in the image, as described, you can run the identical scripts.

As a plus, if you run them on ZFS with de-dup, even the disk cost for new machines is miniscule.

show 2 replies
sparcpileyesterday at 10:56 PM

So someone re-implemented the Beowulf cluster. I guess the other Slashdot memes are ready to come back.

show 4 replies
jmward01today at 1:08 AM

Something like proxmox [1] could make wiping everything and restarting a lot easier. There really isn't a huge penalty between bare metal and a VM now so you get the ability to create/deploy/monitor all from a reasonable interface. If the standard clustering stuff isn't enough their datacenter version looks like it is designed more for this. Never used it though. (no ties to proxmox here, just a guy that likes it!)

[1] https://www.proxmox.com/en/products/proxmox-virtual-environm...

dapperdrakeyesterday at 11:52 PM

Congratulations on learning about distributed electronic computers. (This is worth tinkering with. This is how people actually get good at HPC.)

Pay attention to your SRAM (L3 unified cache), DRAM and swap space tilings.

[Snark] In practice: With memory access latency depending on both the square root of the memory size and the physical lengths of the wires in your cluster this sounds like a case for Adam Drake:

https://adamdrake.com/command-line-tools-can-be-235x-faster-...

MarsIronPItoday at 12:21 AM

I'm planning to set up something similar (but for compiling code). The difference is that my systems came without storage, so I intend to netboot them, which adds a whole other level of complication. I'm planning to use NixOS, with something like nixos-netboot-serve[0].

https://github.com/DeterminateSystems/nix-netboot-serve

kingaillastoday at 4:08 AM

If the author had googled better they might have discovered https://www.learnpdc.org/

yjftsjthsd-htoday at 12:29 AM

Something I've been playing with: what's the cheapest semi-reasonable computer that I could make a toy cluster from? Currently eyeing buying a bunch of Raspberry Pi Zeros but I suspect there are cheaper options (maybe some tiny openwrt thing). Too bad ESP32s don't have an MMU:D

show 3 replies
dapperdrakeyesterday at 11:55 PM

Kudos for Rscript.