The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end networkfailure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standard-compliant implementation of MPI designed to tolerate networkrelated failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented.
AbstracI-Microhenchmarks, i.e. very small computational kernels, have become commonly used for quantitative measures of node performance in clusters. For example, a commonly used benchmark measures the amount of time required to perform a fixed quantum of work. Unfortunately, this benchmark is one of many that vinlate well known rules from sampling theory, leading to erroneous, contradictory or misleading results. At a minimum, these types of henchmdrks can not he used to identify time-based activitia that may interfere with and hence limit application performance. Our original and primary goal remains to identify noise in the system due to periodic activities that are not part of user application code. In this paper, we discuss why the 'fixed quantum of work' benchmark provides data that is of limited use for analysis; and we show code for, discuss, and analyze results frnm a microbenchmark which follows good rules of sampling hygiene, and hence provides useful data for analysis.
There are many APIs for connecting and exchanging data between network peers. Each interface varies wildly based on metrics including performance, portability, and complexity. Specifically, many interfaces make design or implementation choices emphasizing some of the more desirable metrics (e.g., performance) while sacrificing others (e.g., portability). As a direct result, software developers building large, network-based applications are forced to choose a specific network API based on a complex, multi-dimensional set of criteria. Such trade-offs inevitably result in an interface that fails to deliver some desirable features.In this paper, we introduce a novel interface that both supports many features that have become standard (or otherwise generally expected) in other communication interfaces, and strives to export a small, yet powerful, interface. This new interface draws upon years of experience from network-oriented software development best practices to systems-level implementations. The goal is to create a relatively simple, high-level communication interface with low barriers to adoption while still providing important features such as scalability, resiliency, and performance. The result is the Common Communications Interface (CCI): an intuitive API that is portable, efficient, scalable, and robust to meet the needs of network-intensive applications common in HPC and cloud computing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.