Demonstrating the scalability of a molecular dynamics application on a Petaflop computer

Almási, George; Caşcaval, Călin; Castaños, José G.; Denneau, Monty; Donath, W. E.; Eleftheriou, Maria; Giampapa, Mark; Ho, Howard; Lieber, Derek; Moreira, José E.; Newns, Dennis M.; Snir, Marc; Warren, H. Shaw

doi:10.1145/377792.377896

Cited by 29 publications

(29 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Our goal is instead to investigate how to improve MD performance and scalability on a low-cost cluster platform, which is available to individual research groups. George et al [11] conducted similar research at the initial stage of the IBM BlueGene architecture but did not discuss hierarchical optimization. There are other projects like NAMD [12] mainly targeting supercomputing systems composed of conventional clusters.…”

Section: Linked-list Cell Molecular Dynamics Simulationmentioning

confidence: 99%

Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters

et al. 2011

View full text Add to dashboard Cite

We have developed a scalable hierarchical parallelization scheme for molecular dynamics (MD) simulation on multicore clusters. The scheme explores multilevel parallelism combining: (1) Internode parallelism using spatial decomposition via message passing; (2) intercore parallelism using cellular decomposition via multithreading employing a master/worker model; (3) data-level optimization via singleinstruction multiple-data (SIMD) parallelism with various code transformation techniques. By using a hierarchy of parallelisms, the scheme exposes very high concurrency and data locality, thereby achieving: (1) internode weak-scaling parallel efficiency 0.985 on 106,496 BlueGene/L nodes (0.975 on 32,768 BlueGene/P nodes), internode strong-scaling parallel efficiency 0.90 on 8,192 BlueGene/L nodes; (2) inExploiting hierarchical parallelisms for molecular dynamics 21 tercore multithread parallel efficiency 0.65 for eight threads on a dual quadcore Xeon platform; and (3) SIMD speedup around 2 for problem sizes ranging from 3,072 to 98,304 atoms. Furthermore, the effect of memory-access penalty on SIMD performance is analyzed, and an application-based SIMD analysis scheme is proposed to help programmers determine whether their applications are amenable to SIMDization.

show abstract

Section: Linked-list Cell Molecular Dynamics Simulationmentioning

confidence: 99%

Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters

et al. 2011

View full text Add to dashboard Cite

show abstract

“…Our MTS method has been implemented in the Open64 compiler retargeted for the IBM Cyclops64 architecture, a dedicated petaflop platform for running high performance applications [1,28]. Such a machine is built out of tens of thousands of IBM Cyclops64 processing nodes arranged in a 3D-mesh network.…”

Section: Experimental Frameworkmentioning

confidence: 99%

Software-Pipelining on Multi-Core Architectures

Douillet¹,

Gao²

2007

16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007)

View full text Add to dashboard Cite

It is becoming increasingly evident that multi-core chip architecture are emerging as a solution to efficiently amortizing the ever-growing number of transistors on a chip. However the success of such multi-core chips depends on the advances in system software technology, such as compiler and run-time system, in order for the application programs to exploit thread level parallelism out of originally single-threaded applications and to fully utilize the hardware on-chip concurrency.In this paper, we propose a method which, from a parallel and non-parallel imperfect loop nest written in a standard sequential language such as C or Fortran, automatically generates a multi-threaded software-pipelined schedule for multi-core architectures. The generated schedule already contains all the necessary synchronization instructions and is guaranteed free of deadlocks and buffer overflow. The feasibility of the proposed method within a modern compiler infrastructure has been verified through a pilot implementation in the Open64 compiler and tested on the IBM Cyclops multi-core architecture. Experimental results show that the performance exhibits good scalability even with 100 cores. Our light-weight synchronization mechanism minimizes the dependencies stalls and synchronization overheads without the use of dedicated hardware support.

show abstract

“…Many researchers have created analytical models of important kernels and applications [3] [5]. These models range from calculating the number of operations necessary to complete a common mathematical operation, such as a matrix multiply, to complete models of entire codes, such as protein folding.…”

Section: Top-down Algorithmic Model Creationmentioning

confidence: 99%

“…These models range from calculating the number of operations necessary to complete a common mathematical operation, such as a matrix multiply, to complete models of entire codes, such as protein folding. Almasi, et al [3] present such a model of the protein folding application for the original Blue Gene architecture. Their analysis eloquently decomposes the application into its main computational and communication kernels.…”

Section: Top-down Algorithmic Model Creationmentioning

confidence: 99%

A framework to develop symbolic performance models of parallel applications

Alam¹,

Vetter²

2006

Proceedings 20th IEEE International Parallel &Amp; Distributed Processing Symposium

View full text Add to dashboard Cite

Performance and workload modeling has numerous uses at every stage of the high-end computing lifecycle: design, integration, procurement, installation and tuning. Despite the tremendous usefulness of performance models, their construction remains largely a manual, complex, and time-consuming exercise. We propose a new approach to the model construction, called modeling assertions (MA), which borrows advantages from both the empirical and analytical modeling techniques. This strategy has many advantages over traditional methods: incremental construction of realistic performance models, straightforward model validation against empirical data, and intuitive error bounding on individual model terms. We demonstrate this new technique on the NAS parallel CG and SP benchmarks by constructing high fidelity models for the floating-point operation cost, memory requirements, and MPI message volume. These models are driven by a small number of key input parameters thereby allowing efficient design space exploration of future problem sizes and architectures. IntroductionPerformance and workload modeling has numerous uses at every stage of the high-end computing lifecycle: design, integration, procurement, installation, tuning, and maintenance. Despite the tremendous usefulness of performance models, their construction remains largely a manual, complex, and time-consuming exercise. In most cases, researchers create models by manually interrogating applications with an array of performance, debugging, and static analysis tools to refine the model iteratively until the predictions fall within expectations. In other cases, researchers start with an algorithm description, and develop the performance model directly from this abstract description.In this paper, we describe a new approach to performance model construction, called modeling assertions (MA), which borrows advantages from both the empirical and analytical modeling techniques. This strategy has many advantages over traditional methods: isomorphism with the application structure, easy incremental validation of the model with empirical data, uncomplicated sensitivity analysis, and straightforward error bounding on individual model terms. We demonstrate the use of MA by designing a prototype framework, which allows construction, validation, and analysis of models of parallel applications written in FORTRAN and C with the MPI communication library. We use the prototype to construct models of NAS CG and SP benchmarks [4].MA generates two types of representations of the target application: control flow models and symbolic models that can be evaluated with MATLAB or Octave. Symbolic models are generated for the number of floating-point and memory operations, and for MPI point-to-point and collective communication operations. Control flow models provide a mechanism not only to understand the control flow of an application but also to generate alternate model representations in programming languages like C or Python. The models are represented in terms of an application's input pa...

show abstract

Demonstrating the scalability of a molecular dynamics application on a Petaflop computer

Cited by 29 publications

References 12 publications

Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters

Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters

Software-Pipelining on Multi-Core Architectures

A framework to develop symbolic performance models of parallel applications

Contact Info

Product

Resources

About