Analyzing the impact of supporting out-of-order communication on in-order performance with iWARP

Balaji, Pavan; Feng, Wu-chun; Bhagvat, S.; Panda, Dhabaleswar K.; Thakur, Rajeev; Gropp, William

doi:10.1145/1362622.1362670

Cited by 8 publications

(2 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They evaluate reordering of packets within one message and not message reordering. Handling out of order packets in MPI has also been studied by Balaji et al [21] for the Internet Wide-Area RDMA Protocol (iWARP) over 10-Gigabit Ethernet. Sur et al [13] discuss how to provide message ordering within the MPI two-sided implementation on InfiniBand using sequence numbers.…”

Section: B Architectures and Relaxed Orderingmentioning

confidence: 99%

An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect

Ibrahim

Hargrove

Iancu

et al. 2014

2014 IEEE 28th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-The Cray Gemini interconnect hardware provides multiple transfer mechanisms and out-of-order message delivery to improve communication throughput. In this paper we quantify the performance of one-sided and two-sided communication paradigms with respect to: 1) the optimal available hardware transfer mechanism; 2) message ordering constraints; 3) per node and per core message concurrency. In addition to using Cray native communication APIs, we use UPC and MPI micro-benchmarks to capture one-and twosided semantics respectively. Our results indicate that relaxing the message delivery order can improve performance up to 4.6× when compared with strict ordering. When hardware allows it, high-level one-sided programming models can already take advantage of message reordering. Enforcing the ordering semantics of two-sided communication comes with a performance penalty. Furthermore, it seems that exposing outof-order delivery at the application level is required for the next generation of programming models. Any ordering constraints in the language specifications reduce communication performance for small messages and increase the number of active cores required for peak throughput. I. INTRODUCTIONHardware vendors traditionally employed a combination of Remote Direct Memory Access (RDMA) and out-oforder packet delivery to provide communication throughput on large scale multicore systems. At the system level API (Application Programming Interface), messages were usually ordered to meet the semantic requirements of higher level abstractions. Recently, the Cray Gemini hardware and APIs started exposing multiple message ordering modes to its clients. The main contribution of this work is the evaluation of how well equipped to take advantage of hardware out-of-order message delivery are existing onesided and two-sided programming models or communication libraries. To our knowledge, ours is the first study to examine in detail the usage of this functionality in implementations of programming model abstractions.The predominant paradigm for the last twenty years has been the Message Passing Interface (MPI) with its two-sided synchronization semantics. The non-blocking Isend/IRecv communication primitives in MPI can internally take advantage of un-ordered messaging.One-sided communication has started to gain popularity roughly ten years ago, as showcased by the Unified Parallel C [1] programming language. Until recently, the UPC language standard provided only blocking communication and a relaxed order memory model to allow compiler or runtime optimizations. In Nov 2012, both the UPC 1.3 (draft) [2] and the MPI 3.0 specifications [3] introduced user level nonblocking one-sided communication primitives and brought to

show abstract

Section: B Architectures and Relaxed Orderingmentioning

confidence: 99%

An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect

Ibrahim

Hargrove

Iancu

et al. 2014

2014 IEEE 28th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

show abstract

“…High-speed networks have been evaluated for both 10 Gigabit Ethernet network interface cards (NICs) with offload engines [1,8,10] and others [38]. Message Passing Interface (MPI) implementations have support for multiple networks including InfiniBand, iWARP, and solutions such as RoCE [33].…”

Section: Introductionmentioning

confidence: 99%

Scalable connectionless RDMA over unreliable datagrams

et al. 2015

View full text Add to dashboard Cite

a b s t r a c tThe overhead imposed by connection-based protocols for high-performance computing (HPC) systems can be detrimental to system resource usage and performance. This paper demonstrates for the first time a unified send/recv and Remote Direct Memory Access (RDMA) Write over datagrams design for RDMA-capable network adapters. We previously designed the first and only unreliable datagram RDMA model, RDMA Write-Record, and demonstrated its superior performance over connection-based RDMA. RDMA Write-Record can be applied to several RDMA capable networks, such as iWARP and InfiniBand (which does not support unreliable RDMA Writes). iWARP is a state-of-the-art, high-speed, connection-based RDMA networking technology for both local and wide-area Ethernet networks. iWARP is used as the platform to demonstrate our unreliable RDMA operation design for both channel and memory semantics. We previously outlined the requirements for extending iWARP to operate over datagrams.Here we extend our work on commercial datacenter applications by providing broadcast support for send/recv. In order to study the scalability of datagram-iWARP, we added Message Passing Interface support for RDMA Write-Record to investigate the scalability of HPC-based scientific applications for both send/recv and RDMA Write-Record. The results show that both models outperform their connection-based alternatives, providing superior performance and scalability in a software prototype.

show abstract