Design and implementation of message-passing services for the Blue Gene/L supercomputer

Almási, George; Archer, Charles J; Castaños, José G.; Gunnels, John A.; Erway, C. Christopher; Heidelberger, P.; Martorell, Xavier; Moreira, José E.; Pinnow, K.; Ratterman, Joe; Steinmacher-Burow, Burkhard; Gropp, William; Toonen, Brian

doi:10.1147/rd.492.0393

Cited by 46 publications

(40 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This scaling plot shows that use of the BG/L ADE SPI communications interfaces allows continued performance gains to values of atoms per node well below those achievable using MPI. The MPI implementation on Blue Gene/L [20] is quite good as evidenced by the results achieved on the 3D-FFT [22], but the scalability of Blue Matter using MPI appears to be limited by the performance of the neighborhood broadcast and reduce collectives discussed above as can be seen in Table 2. Table 2.…”

Section: Performance Resultsmentioning

confidence: 96%

See 1 more Smart Citation

Blue Matter: Strong Scaling of Molecular Dynamics on Blue Gene/L

Fitch

Rayshubskiy

Eleftheriou

et al. 2006

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. This paper presents strong scaling performance data for the Blue Matter molecular dynamics framework using a novel n-body spatial decomposition and a collective communications technique implemented on both MPI and low level hardware interfaces. Using Blue Matter on Blue Gene/L, we have measured scalability through 16,384 nodes with measured time per time-step of under 2.3 milliseconds for a 43,222 atom protein/lipid system. This is equivalent to a simulation rate of over 76 nanoseconds per day and represents an unprecedented time-to-solution for biomolecular simulation as well as continued speed-up to fewer than three atoms per node. On a smaller, solvated lipid system with 13,758 atoms, we have achieved continued speedups through fewer than one atom per node and less than 2 milliseconds/time-step. On a 92,224 atom system, we have achieved floating point performance of over 1.8 TeraFlops/second on 16,384 nodes. Strong scaling of fixed-size classical molecular dynamics of biological systems to large numbers of nodes is necessary to extend the simulation time to the scale required to make contact with experimental data and derive biologically relevant insights.

show abstract

Section: Performance Resultsmentioning

confidence: 96%

“…We have implemented the second and third options and have found that as a result of optimizations of the MPI collectives for BG/L [20], the third option gives superior performance. Even so, the realized performance on MPI does not yet reflect the full capabilities of the hardware.…”

Section: Parallelization Strategies and Challengesmentioning

confidence: 99%

Blue Matter: Strong Scaling of Molecular Dynamics on Blue Gene/L

Fitch

Rayshubskiy

Eleftheriou

et al. 2006

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…The Pthreads library is used to spawn multiple UPC threads on systems with SMP nodes. Implemented messaging methods include TCP/IP sockets, LAPI [23], Myrinet/GM transport [19] and the BlueGene/L messaging framework [1].…”

Section: The Ibm Xlupc Runtimementioning

confidence: 99%

Scalable RDMA performance in PGAS languages

Farreras

Almási

Caşcaval

et al. 2009

2009 IEEE International Symposium on Parallel &Amp; Distributed Processing

Self Cite

View full text Add to dashboard Cite

Partitioned Global Address Space (PGAS) languages provide a unique programming model that can span shared-memory multiprocessor (SMP) architectures, distributed memory machines, or cluster of SMPs. Users can program large scale machines with easy-to-use, shared memory paradigms.In order to exploit large scale machines efficiently, PGAS language implementations and their runtime system must be designed for scalability and performance. The IBM XLUPC compiler and runtime system provide a scalable design through the use of the Shared Variable Directory (SVD). The SVD stores meta-information needed to access shared data. It is dereferenced, in the worst case, for every shared memory access, thus exposing a potential performance problem.In this paper we present a cache of remote addresses as an optimization that will reduce the SVD access overhead and allow the exploitation of native (remote) direct memory accesses. It results in a significant performance improvement while maintaining the run-time portability and scalability.

show abstract

“…The current implementation of MPI on BG/L [17] is based on MPICH2 [5] from Argonne National Laboratory. The BG/L version is MPI-1.2 compliant [15] and supports a subset of the MPI-2 standard. There are parts of MPI-2, such as dynamic process management, that are not supported.…”

Section: Mpi On Bg/lmentioning

confidence: 99%