logoalt Hacker News

hpcdudelast Friday at 3:02 PM1 replyview on HN

HPCdude here, and this is a mostly correct article, but here are what it misses:

1) It mentions in passing the hardware abstraction not being as universal as it seemed. This is more and more true, once we started doing fpgas, then asics, and as ARM and other platforms starting making headway, it fractured things a bit.

GPUs too: I'm still a bit upset about CUDA winning over OpenCL, but Vulkan compute gives me hope. I haven't messed with SYCL but it might be a future possibility too.

2) The real crux is the practical, production interfaces that HPC end users get. Normally I'm not exposing an entire cluster (or sub cluster) to a researcher. I give them predefined tools which handle the computation across nodes for them (SLURM is old but still huge in the HPC space for a reason! When I searched the article for "slurm" I got 0 hits!) When it comes to science, reproducibility is the name of the game, and having more idempotent and repeatable code structures are what help us gain real insights that others can verify or take and run with into new problem/solutions. Ad-hoc HPC programming doesn't do that well, like existing languages slurm and other orchestration layers handle.

Sidenote: One of the biggest advances recently is in RDMA improvements (remote direct memory access), because the RAM needs of these datasets are growing to crazy numbers, and often you have nodes being underutilized that are happy to help. I've only done RoCE myself though and not much with Infiniband, (sorry Yale, thats why I flubbed on the interview) but honestly, I still really like RoCE for cluster side and LACP for front facing ingress/egress.

The point is existing tooling can be massaged and often we don't need new languages. I did some work with Mellanox/Weka prior to them being bought by Nvidia on optimizing the kernel shims for NFSv4 for example. Old tech made fast again.


Replies

abloblast Friday at 10:38 PM

afaik, you always have to break open abstractions for more performance. If you ignore cache-levels in your program you're gonna have a bad time - and depending on the system the layout (and with it how you should use it) is different. The same is true for how machines are interconnected. Depending on the wiring you have different throughput-values when sharing data between nodes. The whole area screams "not universal" to me.

show 1 reply