I don't think this really describes neon_prefixsum_fast as a whole? The algorithm does use a Hillis-Steele sum on sums of 4 values, but each of these is computed with a sequential sum, interleaving those with a transposed order. In terms of what's added to what, it's actually quite a bit like my "Sequential broadcasting" picture from [0]. The reference I'd use for a general form is "Parallel Scan as a Multidimensional Array Problem"[1], breaking 16 elements into a 4x4 array; the paper describes how the scan splits into a row-wise scan, plus values obtained from an exclusive scan on carries from the rows.
[0] https://mlochbaum.github.io/BQN/implementation/primitive/fol...