Very interesting! Nice work on your thesis. I am curious: if the data is not resident on the GPU (e.g. multi-TB datasets, line-rate packet inspection, etc.), is this approached bottle necked by the PCIe bus?
(You may have addressed this in your thesis, feel free to tell me to go RTFD ;)
I haven't tested this but I would be very surprised if the PCIe bus wasn't a severe bottleneck in that case, unless you can somehow amortize the cost of the memcpy.
Though that being said, with such massive datasets you'll already be bottlenecked by the necessary communication between GPUs (sadly even with NVLink) since the queried data always lives on the GPU.