“…The NIC that connects the node to the parallel network is connected to a programmable "local network" that connects it to the CPU memory as well as the GPU memory. This combination of interconnects means that the parallel communication latency and bandwidth (see the first report [8]) are limited by the NIC, the local network, the NVLinks from the CPU to the local network, and the GPU memory but not the CPU memory. However, CUDA-aware MPI calls (send, receive, and waits) must be called by code running on the CPU cores.…”