To be fair I was suprised too. But I made a relatively simple straight port from the AMD rays sdk plus some input from the pbrt-v4 CPU bvh code and it just worked relatively well out of the box... This is the main intersection function which is quite simple: https://github.com/JuliaGeometry/Raycore.jl/blob/sd/multityp... I'm not even using local memory, since it was already fast enough ;) But I think we can still do quite a lot, large parts of the construction code are still very messy, and I want to polish and modularize the code over time.
makes sense honestly - straight port from a solid SDK beats reimplementing everything from scratch. local memory optimization is one of those rabbit holes anyway. construction code being messy is just that stage of the project