Marc Snir scite author profile

Over the last twenty years, the open source community has provided more and more software on which the world's High Performance Computing (HPC) systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. But although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual PetaScale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and GPUs. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.

show abstract

Efficient and correct execution of parallel programs that share memory

Shasha

Snir

1988

ACM Trans. Program. Lang. Syst.

337

265

View full text Add to dashboard Cite

In this paper we consider an optimization problem that arises in the execution of parallel programs on shared-memory multiple-instruction-stream, multiple-data-stream (MIMD) computers. A program on such machines consists of many sequential program segments, each executed by a single processor. These segments interact as they access shared variables. Access to memory is asynchronous, and memory accesses are not necessarily executed in the order they were issued. An execution is correct if it is sequentially consistent: It should seem as if all the instructions were executed sequentially, in an order obtained by interleaving the instruction streams of the processors. Sequential consistency can be enforced by delaying each access to shared memory until the previous access of the same processor has terminated. For performance reasons, however, we want to allow several accesses by the same processor to proceed concurrently.Our analysis finds a minimal set of delays that enforces sequential consistency. The analysis extends to interprocessor synchronization constraints and to code where blocks of operations have to execute atomically. We use a conflict graph similar to that used to schedule transactions in distributed databases. Our graph incorporates the order on operations given by the program text, enabling us to do without locks even when database conflict graphs would suggest that locks are necessary. Our work has implications for the design of multiprocessors; it offers new compiler optimization techniques for parallel languages that support shared variables.

show abstract

Toward Exascale Resilience

Cappello

Geist

Gropp

et al. 2009

The International Journal of High Performance Computing Applica

284

198

View full text Add to dashboard Cite

Over the past few years resilience has became a major issue for high-performance computing (HPC) systems, in particular in the perspective of large petascale systems and future exascale systems. These systems will typically gather from half a million to several millions of central processing unit (CPU) cores running up to a billion threads. From the current knowledge and observations of existing large systems, it is anticipated that exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint/ restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system. This set of projections leaves the community of fault tolerance for HPC systems with a difficult challenge: finding new approaches, which are possibly radically disruptive, to run applications until their normal termination, despite the essentially unstable nature of exascale systems. Yet, the community has only five to six years to solve the problem. This white paper synthesizes the motivations, observations and research issues considered as determinant of several complimentary experts of HPC in applications, programming models, distributed systems and system management.

show abstract

Generic topology mapping strategies for large-scale parallel architectures

2011

View full text Add to dashboard Cite

The steadily increasing number of nodes in high-performance computing systems and the technology and power constraints lead to sparse network topologies. Efficient mapping of application communication patterns to the network topology gains importance as systems grow to petascale and beyond. Such mapping is supported in parallel programming frameworks such as MPI, but is often not well implemented. We show that the topology mapping problem is NP-complete and analyze and compare different practical topology mapping heuristics. We demonstrate an efficient and fast new heuristic which is based on graph similarity and show its utility with application communication patterns on real topologies. Our mapping strategies support heterogeneous networks and show significant reduction of congestion on torus, fat-tree, and the PERCS network topologies, for irregular communication patterns. We also demonstrate that the benefit of topology mapping grows with the network size and show how our algorithms can be used in a practical setting to optimize communication performance. Our efficient topology mapping strategies are shown to reduce network congestion by up to 80%, reduce average dilation by up to 50%, and improve benchmarked communication performance by 18%. MOTIVATIONThe number of nodes in the largest computing systems, and, hence, the size of their interconnection networks, is increasing rapidly: The Jaguar system at ORNL has over 18,000 nodes and larger systems are expected in the near future. These networks are built by interconnecting nodes (switches and processors) with links. Pin count, power and gate count constraints restrict the number of links per switch; typical sizes are: 24 (InfiniBand), 36 (Myrinet, InfiniBand), or 6 (Sea Star or BlueGene/P). Different topologies are used to construct large-scale networks from crossbars; e.g., kary n-cubes (hypercube, torus), k-ary n-trees (fat-trees), or folded Clos networks. Networks also differ in their routing protocols.As the number of nodes grows larger, the diameter of the network (i.e., the maximum distance between two processors) increases; for many topologies, the bisection bandwidth (i.e., the minimum total bandwidth of links that need to be cut in order to divide the processors into two equal sets) decreases relative to the number of nodes.This effect is well understood and it is generally accepted that dense communication patterns (such as an all-to-all communication where each node communicates to each other) are hard to scale beyond petascale systems. Luckily, the communication patterns of many applications are relatively sparse (each node communicate with a few others), and dense communications can be replaced by repeated sparse communications (e.g., the all-to-all communication used for the transpose in a parallel Fast Fourier Transform can be replaced by two phases of group transposes, each involving only Θ( √ P ) processors [17]). Furthermore, the communication pattern often has significant locality, e.g., when most communication occurs between adjacent cell...

show abstract

Addressing Failures in Exascale Computing

Snir¹,

Wiśniewski²,

Abraham³

et al. 2013

120

151

View full text Add to dashboard Cite

A model for hierarchical memory

et al. 1987

View full text Add to dashboard Cite

Some Exact Complexity Results for Straight-Line Computations over Semirings

Jerrum

Snir

1982

J. ACM

104

124

View full text Add to dashboard Cite

The problem of computing polynomials in certain semmngs is considered. Precise bounds are obtained on the number of multiplications required by straight-hne algorithms which compute such functions as iterated matrix multiplication, iterated convolution, and permanent Usmg these bounds, tt is shown that the use of branching can exponentially speed up computations using the min, + operations, and that subtraction can exponentially speed up arithmetic computations These results can be interpreted as denying the existence of fast "universal" algorithms for computing certain polynomials K~V wol~os AND prmASES artthmeuc complexity, convexity theory, Farkas Lemma, minimax algebra, straight-hne algorithm

show abstract

Hierarchical memory with block transfer

1987

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Marc Snir

The International Exascale Software Project roadmap

Efficient and correct execution of parallel programs that share memory

Toward Exascale Resilience

Generic topology mapping strategies for large-scale parallel architectures

Addressing Failures in Exascale Computing

A model for hierarchical memory

Some Exact Complexity Results for Straight-Line Computations over Semirings

Hierarchical memory with block transfer

Contact Info

Product

Resources

About