Implementation and performance analysis of non-blocking collective operations for MPI

Lecture Notes in Computer Science

Lumsdaine

Dongarra

2009

Self Cite

Abstract. MapReduce is an emerging programming paradigm for dataparallel applications. We discuss common strategies to implement a MapReduce runtime and propose an optimized implementation on top of MPI. Our implementation combines redistribution and reduce and moves them into the network. This approach especially benefits applications with a limited number of output keys in the map phase. We also show how anticipated MPI-2.2 and MPI-3 features, such as MPI Reduce local and nonblocking collective operations, can be used to implement and optimize MapReduce with a performance improvement of up to 25% on 127 cluster nodes. Finally, we discuss additional features that would enable MPI to more efficiently support all MapReduce applications.

Section: Further Optimization Possibilitiesmentioning

confidence: 99%

Towards Efficient MapReduce Using MPI

Lecture Notes in Computer Science

Lumsdaine

Dongarra

2009

Self Cite

“…In order to optimize the parallel algorithm, we reduce the overhead arising from the allreduce step by overlapping its communication with computations that are independent of the communicated data. We use LibNBC's [6] non-blocking version of MPI Allreduce called NBC Iallreduce, and the MPI Wait counterpart NBC Wait.…”

Section: Algorithm Parallelization Conceptmentioning

confidence: 99%

“…LibNBC's allreduce uses multiple communication rounds (cf. [6]). This requires the user to ensure progress manually by calling NBC Test or run a separate thread that manages the progression of LibNBC (i.e., progress thread).…”

Section: Implementation With Libnbcmentioning

confidence: 99%

“…In our case study, the scalability for the fixed problem size (strong scaling) is limited by a collective data reduction operation in which the message size is independent of the number of MPI processes (in our example 48 M iB). To reduce the communication overhead, we transform our code to leverage non-blocking collective operations offered by LibNBC [6], which provide-additionally to the overlapping of communication with computation-high-level communication offload using the InfiniBand network. We analyze the code transformations and provide an analytical runtime model that identifies the overlap potential of our approach.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Communication Optimization for Medical Image Reconstruction Algorithms

Recent Advances in Parallel Virtual Machine and Message Passing Interface

Schellmann

Gorlatch

et al. 2008

Self Cite

Abstract. This paper presents experiences and results obtained in optimizing the parallel communication performance of a production-quality medical image reconstruction application. The fundamental communication operations in the application's principal algorithm are collective reductions. The overhead of these operations was reduced by transforming the algorithm to overlap its computation and communication. Several different approaches to communication progress were studied, both user-directed and asynchronous. Experimental results comparing the new approach to the previous implementation show overall application performance improvements of up to 8%, when run on 32 nodes.

“…This distinction between CPU overhead and network parameters enables researchers to model overlap of communication and computation efficiently. We use this ability to assess the overlap potential of different network interconnect architectures and to optimize the implementation of our non-blocking collective operations library LibNBC [12]. The models of the LogP family have been used by different research groups to derive new algorithms for parallel computing, predict the performance of existing algorithms, or prove an algorithm's optimality [13][14][15][16][17][18].…”

Section: Introductionmentioning

confidence: 99%

LogGP in theory and practice – An in-depth analysis of modern interconnection networks and benchmarking methods for collective operations

Simulation Modelling Practice and Theory

Schneider

Lumsdaine

2009

Accurate measurement and modeling of network performance is important for predicting and optimizing the running time of high-performance computing applications. Although the LogP family of models has proven to be a valuable tool for assessing the communication performance of parallel architectures, non-intrusive LogP parameter assessment of real systems remains a difficult task. Based on an analysis of accuracy and contention properties of existing measurement methods, we develop a new low-overhead measurement method which also assesses protocol changes in the underlying transport layers. We use the gathered parameters to simulate LogGP models of collective operations and demonstrate the errors in common benchmarking methods for collective operations. The simulations provide new insight into the nature of collective algorithms and their pipelining properties. We show that the error grows linearly with the system size.