I appreciate the author's work in doing this and writing it all up so nicely. However every tim...

dent9 • today at 1:04 AM • 0 replies • view on HN

I appreciate the author's work in doing this and writing it all up so nicely. However every time I see someone doing this, I cannot help but wonder why they are not just using SLURM + Nextflow. SLURM can easily cluster the separate computers as worker nodes, and Nextflow can orchestrate the submission of batch jobs to SLURM in a managed pipeline of tasks. The individual tasks to submit to SLURM would be the users's own R scripts (or any script they have). Combine this with Docker containers to execute on the nodes to manage dependencies needed for task execution. And possibly Ansible for the management of the nodes themselves to install the SLURM daemons and packages etc.. Taken together this creates a FAR more portable and system-agnostic and language-agnostic data analysis workflow that can seamlessly scale over as many nodes and data sets as you can shove into it. This is a LOT better than trying to write all this code in R itself that will do the communication and data passing between nodes directly. Its not clear to me that the author actually needs anything like that, and whats worse, I have seen other authors write exactly that in R and end up re-inventing the wheel of implementing parallel compute tasks (in R). Its really not that complicated. 1) write R script that takes a chunk of your data as input, processes it, writes output to some file, 2) use a workflow manager to pass in chunks up the data to discrete parallel task instances of your script / program and submit the tasks as jobs to 3) a hardware-agnostic job scheduler running on your local hardware and/or cloud resources. This is basically the backbone of HPC but it seems like a lot of people "forget" about the 'job scheduler' and 'workflow manager' parts and jump straight to glueing data-analysis code to hardware. Also important to note that most all robust workflow managers such as Nextflow also already include the parts such as "report task completion", "collect task success / failure logs", "report task CPU / memory resource usages", etc.. So that you, the end user, only need to write the parts that implement your data analysis.

alt Hacker News