Evaluating the Potential of Cray Gemini Interconnect for PGAS Communication Runtime Systems

Vishnu, Abhinav; Bruggencate, Monika ten; Olson, Ryan M.

doi:10.1109/hoti.2011.19

Cited by 19 publications

(8 citation statements)

References 21 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, BigInt does not increase communication time significantly because the additional bits lead to a minor increase since packet headers and fixed (zero-load) delays in the network and NICs dominate for small transfers. Our results are confirmed by past work [21] which showed that the communication delay for messages with payloads containing double-word, quad-word, and double quad-word variables is identical and approximately 12 µs for up to 512 bytes, which is a larger payload size than BigInts (256 bytes).…”

Section: Performance Evaluationsupporting

confidence: 79%

“…This was previously possible only with double-precision variables due to the complex internal structures of arbitrary-precision libraries. Even though the size of a BigInt variable is 2101 bits-33× larger than a 64-bit double-precision variable-we observe insignificant loss in communication delay due to the fixed latency costs and packet overhead bytes in modern large-scale networks [21]. Therefore, BigInts readily apply to network operations and can be combined with past work on local-node computations that uses sorting and recursion or alternative wide fixedpoint representations with dedicated hardware support [22], [23], [15], in order to provide reproducible system-wide operations with no precision loss.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Extending Summation Precision for Network Reduction Operations

Bailey

Shalf

2013

2013 25th International Symposium on Computer Architecture and High Performance Computing

View full text Add to dashboard Cite

Abstract-Double precision summation is at the core of numerous important algorithms such as Newton-Krylov methods and other operations involving inner products, but the effectiveness of summation is limited by the accumulation of rounding errors, which are an increasing problem with the scaling of modern HPC systems and data sets. To reduce the impact of precision loss, researchers have proposed increasedand arbitrary-precision libraries that provide reproducible error or even bounded error accumulation for large sums, but do not guarantee an exact result. Such libraries can also increase computation time significantly. We propose big integer (BigInt) expansions of double precision variables that enable arbitrarily large summations without error and provide exact and reproducible results. This is feasible with performance comparable to that of double-precision floating point summation, by the inclusion of simple and inexpensive logic into modern NICs to accelerate performance on large-scale systems.

show abstract

Section: Performance Evaluationsupporting

confidence: 79%

Section: Introductionmentioning

confidence: 99%

Extending Summation Precision for Network Reduction Operations

Bailey

Shalf

2013

2013 25th International Symposium on Computer Architecture and High Performance Computing

View full text Add to dashboard Cite

show abstract

“…Compiler based approaches, such as OMPI [16], similarly Friedley and Lumsdaine describe a compiler approach, producing a 40% improvement, by exploiting onesided communication via transformation of MPI calls [4]. Similar work exploits one-sided communication within the Partitioned Global Address Space (PGAS) languages like Chapel [2], UPC [20], Global Arrays [14], and Co-Array FORTRAN [15] with some preliminary Gemini work using the DMAPP API [21].…”

Section: Performance Resultsmentioning

confidence: 99%

Optimizing fine-grained communication in a biomolecular simulation application on Cray XK6

Sun

Zheng

Mei

et al. 2012

2012 International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Abstract-Achieving good scaling for fine-grained communication intensive applications on modern supercomputers remains challenging. In our previous work, we have shown that such an application -NAMD -scales well on the full Jaguar XT5 without long-range interactions; Yet, with them, the speedup falters beyond 64K cores. Although the new Gemini interconnect on Cray XK6 has improved network performance, the challenges remain, and are likely to remain for other such networks as well. We analyze communication bottlenecks in NAMD and its CHARM++ runtime, using the Projections performance analysis tool. Based on the analysis, we optimize the runtime, built on the uGNI library for Gemini. We present several techniques to improve the fine-grained communication. Consequently, the performance of running 92224-atom Apoa1 with GPUs on TitanDev is improved by 36%. For 100-million-atom STMV, we improve upon the prior Jaguar XT5 result of 26 ms/step to 13 ms/step using 298,992 cores on Jaguar XK6.

show abstract

“…Vishnu et al [23], [24] present the implementation of the Aggregate Remote Memory Copy Interface (ARMCI) on Cray XE6 using DMAPP with relaxed ordering. Shan et al [25] present a performance evaluation of UPC and MPI benchmarks on Gemini and show applications using single-sided communication outperform those using twosided paradigm.…”

Section: B Architectures and Relaxed Orderingmentioning

confidence: 99%

An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect

Ibrahim

Hargrove

Iancu

et al. 2014

2014 IEEE 28th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-The Cray Gemini interconnect hardware provides multiple transfer mechanisms and out-of-order message delivery to improve communication throughput. In this paper we quantify the performance of one-sided and two-sided communication paradigms with respect to: 1) the optimal available hardware transfer mechanism; 2) message ordering constraints; 3) per node and per core message concurrency. In addition to using Cray native communication APIs, we use UPC and MPI micro-benchmarks to capture one-and twosided semantics respectively. Our results indicate that relaxing the message delivery order can improve performance up to 4.6× when compared with strict ordering. When hardware allows it, high-level one-sided programming models can already take advantage of message reordering. Enforcing the ordering semantics of two-sided communication comes with a performance penalty. Furthermore, it seems that exposing outof-order delivery at the application level is required for the next generation of programming models. Any ordering constraints in the language specifications reduce communication performance for small messages and increase the number of active cores required for peak throughput. I. INTRODUCTIONHardware vendors traditionally employed a combination of Remote Direct Memory Access (RDMA) and out-oforder packet delivery to provide communication throughput on large scale multicore systems. At the system level API (Application Programming Interface), messages were usually ordered to meet the semantic requirements of higher level abstractions. Recently, the Cray Gemini hardware and APIs started exposing multiple message ordering modes to its clients. The main contribution of this work is the evaluation of how well equipped to take advantage of hardware out-of-order message delivery are existing onesided and two-sided programming models or communication libraries. To our knowledge, ours is the first study to examine in detail the usage of this functionality in implementations of programming model abstractions.The predominant paradigm for the last twenty years has been the Message Passing Interface (MPI) with its two-sided synchronization semantics. The non-blocking Isend/IRecv communication primitives in MPI can internally take advantage of un-ordered messaging.One-sided communication has started to gain popularity roughly ten years ago, as showcased by the Unified Parallel C [1] programming language. Until recently, the UPC language standard provided only blocking communication and a relaxed order memory model to allow compiler or runtime optimizations. In Nov 2012, both the UPC 1.3 (draft) [2] and the MPI 3.0 specifications [3] introduced user level nonblocking one-sided communication primitives and brought to

show abstract

Evaluating the Potential of Cray Gemini Interconnect for PGAS Communication Runtime Systems

Cited by 19 publications

References 21 publications

Extending Summation Precision for Network Reduction Operations

Extending Summation Precision for Network Reduction Operations

Optimizing fine-grained communication in a biomolecular simulation application on Cray XK6

An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect

Contact Info

Product

Resources

About