Abstract. Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth and message rate. We also demonstrate application performance improvements with comparable programming complexity.Keywords: Programming systems, programming models, Remote Memory Access (RMA), Remote Direct Memory Access (RDMA), MPI-3, one sided communication MotivationNetwork interfaces evolve rapidly to implement a growing set of features directly in hardware. A key feature of today's high-performance networks is remote direct memory access (RDMA). RDMA enables a process to directly access memory on remote processes without involvement of the operating system or activities at the remote side. This hardware support enables a powerful programming mode similar to shared memory programming. RDMA is supported by on-chip networks in, e.g., Intel's SCC and IBM's Cell systems, as well as off-chip networks such as InfiniBand [30,37] From a programmer's perspective, parallel programming schemes can be split into three categories: (1) shared memory with implicit communication and explicit synchronization, (2) message passing with explicit communication and implicit synchronization (as side-effect of communication) and (3) remote memory access and partitioned global address space (PGAS) where synchronization and communication are managed independently.Architects realized early that shared memory can often not be efficiently emulated on distributed machines [19]. Thus, message passing became the de facto standard for large-scale parallel programs [28]. However, with the advent of RDMA networks, it became clear that message passing over RDMA incurs additional overheads in comparison with native remote memory access (RMA, aka. PGAS) programming [6,7,29]. This is mainly due to message matching, practical issues with overlap, and because fast message passing libraries over RDMA usually require different protocols [41]: an eager protocol with receiver-side buffering of small messages and a rendezvous protocol that synchronizes the sender. Eager requires additional copies, and rendezvous sends additional messages and may delay the sending process.In summary, directly programming RDMA hardware has benefits in the...
No abstract
Moving data between processes has often been discussed as one of the major bottlenecks in parallel computing-there is a large body of research, striving to improve communication latency and bandwidth on different networks, measured with ping-pong benchmarks of different message sizes. In practice, the data to be communicated generally originates from application data structures and needs to be serialized before communicating it over serial network channels. This serialization is often done by explicitly copying the data to communication buffers. The message passing interface (MPI) standard defines derived datatypes to allow zero-copy formulations of non-contiguous data access patterns. However, many applications still choose to implement manual pack/unpack loops, partly because they are more efficient than some MPI implementations. MPI implementers on the other hand do not have good benchmarks that represent important application access patterns. We demonstrate that the data serialization can consume up to 80 % of the total communication overhead for important applications. This indicates that most of the current research on optimizing serial network transfer times may be targeted at the smaller fraction of the communication overhead. To support the scientific community, we extracted the send/recv-buffer access patterns of a representative set of scientific applications to build a benchmark that includes serialization and communication of application data and thus reflects all communication overheads. This can be used like traditional pingpong benchmarks to determine the holistic communication latency and bandwidth
Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth, and message rate. We also demonstrate application performance improvements with comparable programming complexity.
Abstract. Data is often communicated from different locations in application memory and is commonly serialized (copied) to send buffers or from receive buffers. MPI datatypes are a way to avoid such intermediate copies and optimize communications, however, it is often unclear which implementation and optimization choices are most useful in practice. We extracted the send/recv-buffer access pattern of a representative set of scientific applications into micro-applications that isolate their data access patterns. We also observed that the buffer-access patterns in applications can be categorized into three different groups. Our microapplications show that up to 90% of the total communication time can be spent with local serialization and we found significant performance discrepancies between state-of-the-art MPI implementations. Our microapplications aim to provide a standard benchmark for MPI datatype implementations to guide optimizations similarly to SPEC CPU and the Livermore loops do for compiler optimizations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.