logoalt Hacker News

lmeyerov10/11/20241 replyview on HN

Nice!

It's interesting from the perspective of maintenance too. You can bet most constants like warp sizes will change, so you get into things like having profiles, autotuners, or not sweating the small stuff.

We went more extreme, and nowadays focus on several layers up: By accepting the (high!) constant overheads of tools like RAPIDS cuDF , we get in exchange the ability to easily crank code with good saturation on the newest GPUs and that any data scientist can edit and extend. Likewise, they just need to understand basics like data movement and columnar analytics data reps to make GPU pipelines. We have ~1 CUDA kernel left and many years of higher-level.

As an example, this is one of the core methods of our new graph query language GFQL (think cypher on pandas/spark, w optional GPU runtime), and it gets Graph500 level performance on cheapo GPUs just by being data parallel with high saturation per step: https://github.com/graphistry/pygraphistry/blob/master/graph... . Despite ping-ponging a ton because cudf doesn't (yet) coalesce GPU kernel calls, V1 competes surprisingly high, and is easy to maintain & extend.


Replies

trentnelson10/11/2024

Had any exposure to r=2 hypergraph implementations on the GPU? Ideally with an efficient way to determine if the graph is acyclic?

(The CPU algos for doing this work great on CPUs but are woeful on GPUs.)

show 1 reply