Awesome stuff :) The results are really impressive.
GPU acceleration is a really good idea; my CPU implementation of a similar idea with triangles (https://github.com/anematode/triangle-stacking) was compute constrained for smaller images (which was fixable with good SIMD optimizations) but became bandwidth constrained for very large images that don't fit in L2. I think a port to OpenCL would have been a good idea.