Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems

Dózsa, Gábor; Kumar, Sameer; Balaji, Pavan; Buntinas, Darius; Goodell, David; Gropp, William; Ratterman, Joe; Thakur, Rajeev

doi:10.1007/978-3-642-15646-5_2

Cited by 24 publications

(19 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We continued to work on the development of lightweight thread support for MPICH. Work early in the project was performed in collaboration with both Argonne and the IBM Blue Gene team [10]. Work on fine grain multithreading support showed how to avoid excessive lock overhead in an MPI implementation [3,2].…”

Section: Some Of the Most Interesting Results From This Project Addrementioning

confidence: 99%

Final Report for Enhancing the MPI Programming Model for PetaScale Systems

Gropp

2013

View full text Add to dashboard Cite

show abstract

Section: Some Of the Most Interesting Results From This Project Addrementioning

confidence: 99%

Final Report for Enhancing the MPI Programming Model for PetaScale Systems

Gropp

2013

View full text Add to dashboard Cite

show abstract

“…This feature can be used to a mixed programming model, like the one explored by researchers in [22], where UPC and MPI were used to scale a memory bound application. In [16], the authors use parallel communication channels to speedup MPI message rate. PAMI extends and generalizes this notion of communication parallelism using PAMI Contexts and uses a new message handoff technique to accelerate message rate.…”

Section: A Related Workmentioning

confidence: 99%

“…Such an implementation is thread safe, but has limited scalability due to the global lock. We explored fine grained locking and lockless techniques in MPICH2 [13,16]. We extended request allocators by creating thread private pools to minimize locking overheads.…”

Section: A Multi Threaded Mpi Over Pamimentioning

confidence: 99%

PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer

Kumar

Mamidala

Faraj

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

The Blue Gene/Q machine is the next generation in the line of IBM massively parallel supercomputers, designed to scale to 262144 nodes and sixteen million threads. With each BG/Q node having 68 hardware threads, hybrid programming paradigms, which use message passing among nodes and multi-threading within nodes, are ideal and will enable applications to achieve high throughput on BG/Q. With such unprecedented massive parallelism and scale, this paper is a groundbreaking effort to explore the design challenges for designing a communication library that can match and exploit such massive parallelism In particular, we present the Parallel Active Messaging Interface (PAMI) library as our BG/Q library solution to the many challenges that come with a machine at such scale. PAMI provides (1) novel techniques to partition the application communication overhead into many contexts that can be accelerated by communication threads; (2) client and context objects to support multiple and different programming paradigms; (3) lockless algorithms to speed up MPI message rate; and (4) novel techniques leveraging the new BG/Q architectural features such as the scalable atomic primitives implemented in the L2 cache, the highly parallel hardware messaging unit that supports both point-to-point and collective operations, and the collective hardware acceleration for operations such as broadcast, reduce, and allreduce. We experimented with PAMI on 2048 BG/Q nodes and the results show high messaging rates as well as low latencies and high throughputs for collective communication operations.

show abstract

“…While there is little related work on the endpoints as introduced in [10], a large body of work exists on the exploitation of shared memory nodes within MPI or other parallel programming languages like UPC [7], [9], [13]. Many of these papers focus on optimizing various communication primitives by means of a shared memory region.…”

Section: Related Workmentioning

confidence: 99%

Network Endpoints for Clusters of SMPs

Tanase

Almasi

Xue³

et al. 2012

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

View full text Add to dashboard Cite

Modern large scale parallel machines feature an increasingly deep hierarchy of interconnections. Individual processing cores employ simultaneous multithreading (SMT) to better exploit functional units; multiple coherent processors are collocated in a node to better exploit links to cache, memory and network (SMP); and multiple nodes are interconnected by specialized low latency/high speed networks. Current trends indicate ever wider SMP nodes in the future. To service these nodes, modern high performance network devices (including Infiniband and all of IBM's recent offerings) offer the ability to sub-divide the network devices' resources among the processing threads. System software, however, lags in exploiting these capabilities, leaving users of e.g., MPI[14], UPC[19] in a bind, requiring complex and fragile workarounds in user programs.In this paper we discuss our implementation of endpoints, the software paradigm central to the IBM PAMI messaging library [3]. A PAMI endpoint is an expression in software of a slice of the network device. System software can service endpoints without serializing the many threads on an SMP by forcing them through a critical section. In the paper we describe the basic guarantees offered by PAMI to the programmer, and how these can be used to enable efficient implementations of high level libraries and programming languages like UPC. We evaluate the efficiency of our implementation on a novel P7IH system with up to 4096 cores, running microbenchmarks designed to find performance deficiencies in the endpoints implementation of both point-to-point and collective functions.

show abstract

Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems

Cited by 24 publications

References 5 publications

Final Report for Enhancing the MPI Programming Model for PetaScale Systems

Final Report for Enhancing the MPI Programming Model for PetaScale Systems

PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer

Network Endpoints for Clusters of SMPs

Contact Info

Product

Resources

About