Rob Aulwes scite author profile

We discuss the unique architectural elements of the Los Alamos Message Passing Interface (LA-MPI), a high-performance, network-fault-tolerant, thread-safe MPI library. LA-MPI is designed for use on terascale clusters which are inherently unreliable due to their sheer number of system components and tradeoffs between cost and performance. We examine in detail the design concepts used to implement LA-MPI. These include reliability features, such as applicationlevel checksumming, message retransmission, and automatic message rerouting. Other key performance enhancing features, such as concurrent message routing over multiple, diverse network adapters and protocols, and communication-specific optimizations (e.g., shared memory) are examined.

show abstract

Network Fault Tolerance in LA-MPI

Aulwes

Daniel

Desai

et al. 2003

View full text Add to dashboard Cite

Abstract. LA-MPI is a high-performance, network-fault-tolerant implementation of MPl designcd for terascale clusters that are inherently unreliable due to their very large number of system components and to trade-offs between cost and pcrformance. This paper reviews the architectural design of LA-MPI, focusing on our approach to guaranteeing data integrity. We discuss our network data path abstraction t,hat makes LA-MPI highly portable, givcs high-performance through mcssage striping, and niost importantly provides the basis for network fault tolerance. Finiilly we inclutlc some performancc numbers for the Quadrics and UDP network paths.

show abstract

Monte Carlo Application ToolKit (MCATK)

Adams

Nolen

Sweezy

et al. 2014

View full text Add to dashboard Cite

show abstract

High Performance Broadcast Support in La-Mpi Over Quadrics

Yu¹,

Sur²,

Panda

et al. 2005

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

LA-MPI is a unique MPI implementation that provides end-to-end reliable message passing between application processes. LA-MPI collective operations are implemented on top of its point-to-point operations, using generic spanning tree-based collective algorithms. The performance of the collective operations scales in a logarithmic order over that of the point-to-point operations. Thus, it is desirable to provide more efficient and more scalable collective operations while maintaining the end-to-end reliability. To this end, we investigate the feasibility of utilizing Quadrics hardware broadcast in this paper. We explore several challenging issues such as broadcast buffer management, broadcast over arbitrary processes, retransmission and reliability. Accordingly, a low-latency, highly scalable, fault-tolerant broadcast algorithm is designed and implemented over Quadrics hardware broadcast. Our evaluation shows that this implementation reduces broadcast latency and achieves higher scalability relative to the generic version of this operation. In addition, we observe that the performance of our implementation is comparable to that of the high performance implementation by Quadrics Supercomputers World for MPICH, and HPs for Alaska MPI, while providing fault tolerance to network errors not provided by these.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Rob Aulwes

In search of numerical consistency in parallel programming

Architecture of LA-MPI, a network-fault-tolerant MPI

Network Fault Tolerance in LA-MPI

Monte Carlo Application ToolKit (MCATK)

High Performance Broadcast Support in La-Mpi Over Quadrics

Contact Info

Product

Resources

About