logoalt Hacker News

hayley-pattontoday at 12:08 PM1 replyview on HN

As not mentioned in the article, if you want the general form of this algorithm, it is a Hillis-Steele prefix sum: <https://en.wikipedia.org/wiki/Prefix_sum#Algorithm_1:_Shorte...>


Replies

mlochbaumtoday at 4:07 PM

I don't think this really describes neon_prefixsum_fast as a whole? The algorithm does use a Hillis-Steele sum on sums of 4 values, but each of these is computed with a sequential sum, interleaving those with a transposed order. In terms of what's added to what, it's actually quite a bit like my "Sequential broadcasting" picture from [0]. The reference I'd use for a general form is "Parallel Scan as a Multidimensional Array Problem"[1], breaking 16 elements into a 4x4 array; the paper describes how the scan splits into a row-wise scan, plus values obtained from an exclusive scan on carries from the rows.

[0] https://mlochbaum.github.io/BQN/implementation/primitive/fol...

[1] https://ashinkarov.github.io/pubs/2022-scan.html