PCIe already allows DMA between peers on the bus, but, as you pointed out, the traces for the lanes have to terminate somewhere. However, it doesn't have to be the CPU (which is, of course, the PCIe root in modern systems) handling the traffic - a PCIe switch may be used to facilitate DMA between devices attached to it, if it supports routing DMA traffic directly.
That’s what happened in TFA.