SUMMARYClustered systems have become a dominant architecture of scalable high-performance super computers. In these large-scale computers, the network performance and scalability is as critical as the computenodes speed. InfiniBand TM has become a commodity networking solution supporting the stringent latency, bandwidth and scalability requirements of these clusters. The network performance is also affected by its topology, packet routing and the communication patterns the distributed application exercises. Fattrees are the topology structures used for constructing most large clusters as they are scalable, maintain cross-bisectional-bandwidth (CBB), and are practical to build using fixed-arity switches. In this paper, we propose a fat-tree routing algorithm that provides a congestion-free, all-to-all shift pattern leveraging on the InfiniBand TM static routing capability. The algorithm supports partially populated fat-trees built with switches of arbitrary number of ports and CBB ratios. To evaluate the proposed algorithm, detailed switch and host simulation models were developed and multiple fabric topologies were run. The results of these simulations as well as measurements on real clusters show an improvement in all-to-all delay by avoiding congestion on the fabric.
The advent of RoCE (RDMA over Converged Ethernet) has led to a significant increase in the use of RDMA in datacenter networks. To achieve good performance, RoCE requires a lossless network which is in turn achieved by enabling Priority Flow Control (PFC) within the network. However, PFC brings with it a host of problems such as head-of-the-line blocking, congestion spreading, and occasional deadlocks. Rather than seek to fix these issues, we instead ask: is PFC fundamentally required to support RDMA over Ethernet?We show that the need for PFC is an artifact of current RoCE NIC designs rather than a fundamental requirement. We propose an improved RoCE NIC (IRN) design that makes a few simple changes to the RoCE NIC for better handling of packet losses. We show that IRN (without PFC) outperforms RoCE (with PFC) by 6-83% for typical network scenarios. Thus not only does IRN eliminate the need for PFC, it improves performance in the process! We further show that the changes that IRN introduces can be implemented with modest overheads of about 3-10% to NIC resources. Based on our results, we argue that research and industry should rethink the current trajectory of network support for RDMA.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.