A beautiful algorithm.
Would there be any value in using simd to check the whole cache line that you fetch for exact matches on the narrowing phase for an early out?