Side note, but this product looks really cool! I have a fundamental mistrust of all boolean operations, so to see a system that actually works with degenerate cases correctly is refreshing.
I looked into this because part of our pipeline is forced to be chunked. Most advice I've seen boils down to "more contiguity = better", but without numbers, or at least not generalizable ones.
My concrete tasks will already reach peak performance before 128 kB and I couldn't find pure processing workloads that benefit significantly beyond 1 MB chunk size. Code is linked in the post, it would be nice to see results on more systems.
Would kernel huge pages possibly have an effect here also?
is this an attempt at nerd sniping? ;-)
on GPU databases sometimes we go up to the GB range per "item of work" (input permitting) as it's very efficient.
I need to add it to my TODO list to have a look at your github code...
This is good data, but I'm not sure what the actionable is for me as a Grug Programmer.
It means if I'm doing very light processing (sums) I should try to move that to structure-of-arrays to take advantage of cache? But if I'm doing something very expensive, I can leave it as array-of-structures, since the computation will dominate the memory access in Amdahl's Law analysis?
This data should tell me something about organizing my data and accessing it, right?
I’ve casually experimented with this in python a number of times for various hot loops, including those where I’m passing the chunk between c routines. On Apple M1 I’ve never seen a case where chunks larger than 16k mattered. That’s the page size, so totally unsurprising.
Nevertheless it’s been a helpful rule of thumb to not overthink optimizations.