It is funny how we often assume we need a graphics card for these kinds of calculations when a standard processor is actually plenty fast. The specific changes to the memory layout seemed to make the biggest difference here by allowing the hardware to actually use its vector capabilities.
These days a single machine with lots of ram and cores will handle almost everything you throw at it, barring specific compute intensive / memory bound scenarios ( current AI, gaming etc ).
At risk of being called out for my ignorance (I am still new to GPU development and have only limited experience with CUDA), it seems to come down to how appropriate the execution model is to the work e.g. SIMT vs SIMD here.