Untitled

In this paper, we present a comprehensive performance comparison of MPI implementations over InfiniBand, Myrinet and Quadrics. Our performance evaluation consists of two major parts. The first part consists of a set of MPI level micro-benchmarks that characterize different aspects of MPI implementations. The second part of the performance evaluation consists of application level benchmarks. We have used the NAS Parallel Benchmarks and the sweep3D benchmark. We not only present the overall performance results, but also relate application communication characteristics to the information we acquired from the micro-benchmarks. Our results show that the three MPI implementations all have their advantages and disadvantages. For our 8-node cluster, InfiniBand can offer significant performance improvements for a number of applications compared with Myrinet and Quadrics when using the PCI-X bus. Even with just the PCI bus, InfiniBand can still perform better if the applications are bandwidth-bound.

show abstract

Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

Buntinas

Goglin

Goodell

et al. 2009

View full text Add to dashboard Cite

The emergence of multicore processors raises the need to efficiently transfer large amounts of data between local processes. MPICH2 is a highly portable MPI implementation whose large-message communication schemes suffer from high CPU utilization and cache pollution because of the use of a double-buffering strategy, common to many MPI implementations. We introduce two strategies offering a kernel-assisted, single-copy model with support for noncontiguous and asynchronous transfers. The first one uses the now widely available vmsplice Linux system call; the second one further improves performance thanks to a custom kernel module called KNEM. The latter also offers I/OAT copy offload, which is dynamically enabled depending on both hardware cache characteristics and message size. These new solutions outperform the standard transfer method in the MPICH2 implementation when no cache is shared between the processing cores or when very large messages are being transferred. Collective communication operations show a dramatic improvement, and the IS NAS parallel benchmark shows a 25% speedup and better cache efficiency.

show abstract

MPI on a Million Processors

Balaji

Buntinas

Goodell

et al. 2009

View full text Add to dashboard Cite

Data Transfers between Processes in an SMP System: Performance Study and Application to MPI

Buntinas

Mercier

Gropp

View full text Add to dashboard Cite

This paper focuses on the transfer of large data in SMP systems. Achieving good performance for intranode communication is critical for developing an efficient communication system, especially in the context of SMP clusters. We evaluate the performance of five transfer mechanisms: sharedmemory buffers, message queues, the Ptrace system call, kernel module-based copy, and a high-speed network. We evaluate each mechanism based on latency, bandwidth, its impact on application cache usage, and its suitability to support MPI twosided and one-sided messages. I. MOTIVATION AND SCOPEDesigning a communication system tailored for a particular architecture requires understanding the achievable performance levels of the underlying hardware and software. Such understanding is key to a more efficient design and better performance for interprocess communication. Interprocess communication usually falls into two main categories: communication between processes within an SMP node, and communication between processes on different nodes. Considerable research has been carried out in the latter case where communication is involved over various high-performance networks. Communicating over shared memory is a field of study that regained popularity with the growing market of SMP clusters.In this paper, we focus on the shared-memory case and analyze five methods of transferring data between processes on an SMP. We compare their performance based on the usual metrics of latency and throughput. We also consider three other important factors that have been generally overlooked in the past: scalability; the effects of the data transfer operation on processor caches, specifically application data located in the cache; and the setup time required to use the mechanism. We focus on mechanisms available on Intel Xeon-based SMP nodes; however, we believe that similar mechanisms can be used on other architectures with similar results.The structure of this paper is as follows. In Section II, we describe the data transfer mechanisms that we considered. In Section III we present our performance evaluation of the mechanisms with regard to the different metrics chosen. In Section IV we discuss the suitability of the different mechanisms to support large MPI two-sided messages and one-sided

show abstract

MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory

et al. 2013

View full text Add to dashboard Cite

show abstract

Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem.

Buntinas¹,

Mercier²,

Gropp³

2005

View full text Add to dashboard Cite

This paper presents a new low-level communication subsystem called Nemesis. Nemesis has been designed and implemented to be scalable and efficient both in the intranode communication context using shared-memory and in the internode communication case using high-performance networks and is natively multimethod-enabled. Nemesis has been integrated in MPICH2 as a CH3 channel and delivers better performance than other dedicated communication channels in MPICH2. Furthermore, the resulting MPICH2 architecture outperforms other MPI implementations in point-to-point benchmarks.

show abstract

Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem

2007

View full text Add to dashboard Cite

This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its shared-memory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using microbenchmarks. The evaluation shows that MPICH2 Nemesis has very low communication overhead, making it suitable for smaller-grained applications.

show abstract

Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols

Buntinas

Coti

Hérault

et al. 2008

Future Generation Computer Systems

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Darius Buntinas

Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics

Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis

MPI on a Million Processors

Data Transfers between Processes in an SMP System: Performance Study and Application to MPI

MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory

Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem.

Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem

Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols

Contact Info

Product

Resources

About