I wonder why H100 H2D and D2H unpinned memcpy bandwidth is *faster* on PCIe with vendor B than on SXM with vendor D. Is resizable BAR available on PCIe but not SXM?
Or, could it be a software configuration difference? The driver API flag CU_MEMHOSTREGISTER_IOMEMORY states that host memory being physically contiguous may matter to the driver, in this context for memory-mapped memory. If vendor B has THP enabled or configured differently than vendor D, small allocations up to 2 MiB could be physically contiguous which may result in higher efficiency/more bytes transferred per request.
At a higher level: unpinned memcpy is a performance antipattern. Perhaps vendor D has fewer clients using unpinned memcpy in their workloads than vendor B, or they decided not to dedicate support to it for this reason. TensorFlow will go to great lengths to copy unpinned memory to a pinned staging buffer if you feed unpinned host memory tensors to a graph.
Are both using a PCIe switch? If one is and the other isn't, it could be about PCIe credit based flow control kicking in.