Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems

Wu, Xingfu; Taylor, Valerie; Lively, Charles; Sharkawi, Sameh

doi:10.1109/icpp-w.2008.21

Cited by 16 publications

(11 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The goal of processor binding is to reduce the conflicts of chip resources on the CMP system. In our previous work [14], we found that processor binding resulted in up to 7.16% performance improvements for MPI scientific applications. Here, we use the command pbind to implement a batch process to bind the threads to different physical processors in order to reduce the resource contentions and system overhead from the dynamic scheduler on Pangu.…”

Section: Figures 4 and 5 Show The Function-level Performance Of Our Omentioning

confidence: 98%

An OpenMP Approach to Modeling Dynamic Earthquake Rupture Along Geometrically Complex Faults on CMP Systems

Duan

Taylor

2009

2009 International Conference on Parallel Processing Workshops

Self Cite

View full text Add to dashboard Cite

Chip multiprocessors (CMP) are widely used for high performance computing and are being configured in a hierarchical manner to compose a CMP compute node in a parallel system. OpenMP parallel programming within such a CMP node can take advantage of the globally shared address space and on-chip high inter-core bandwidth and low inter-core latency. In this paper, we use OpenMP to parallelize a sequential earthquake simulation code for modeling spontaneous dynamic earthquake rupture along geometrically complex faults on two CMP systems, IBM POWER5+ system and SUN Opteron server. The experimental results indicate that the OpenMP implementation has the accurate output results and the good scalability on the two CMP systems. Further, we apply the optimization techniques such as large page and processor binding to the OpenMP implementation to achieve up to 7.05% performance improvement on the CMP systems without any code modification.

show abstract

Section: Figures 4 and 5 Show The Function-level Performance Of Our Omentioning

confidence: 98%

An OpenMP Approach to Modeling Dynamic Earthquake Rupture Along Geometrically Complex Faults on CMP Systems

Duan

Taylor

2009

2009 International Conference on Parallel Processing Workshops

Self Cite

View full text Add to dashboard Cite

show abstract

“…It is expected that the best number of threads per node is dependent upon the application characteristics and the system architectures. In this paper, we investigate how a hybrid application is sensitive to different memory access patterns, and quantify the performance gap resulting from using different number of threads per node for application execution on a large scale multithreaded BlueGene/Q supercomputer [1] at Argonne National Laboratory using five different hybrid MPI/OpenMP scientific applications (two NAS Parallel benchmarks Multi-Zone SP-MZ and BT-MZ [4], an earthquake simulation PEQdyna [20], an aerospace application PMLB [19] and a 3D particle-in-cell application GTC [2]).…”

Section: Introductionmentioning

confidence: 99%

Performance Characteristics of Hybrid MPI/OpenMP Scientific Applications on a Largescale Multithreaded BlueGene/Q Supercomputer

Wu¹,

Taylor²

2013

IJNDC

Self Cite

View full text Add to dashboard Cite

Many/multi-core supercomputers provide a natural programming paradigm for hybrid MPI/OpenMP scientific applications. In this paper, we investigate the performance characteristics of five hybrid MPI/OpenMP scientific applications (two NAS Parallel benchmarks Multi-Zone SP-MZ and BT-MZ, an earthquake simulation PEQdyna, an aerospace application PMLB and a 3D particle-in-cell application GTC) on a large-scale multithreaded BlueGene/Q supercomputer at Argonne National laboratory, and quantify the performance gap resulting from using different number of threads per node. We use performance tools and MPI profile and trace libraries available on the supercomputer to analyze and compare the performance of these hybrid scientific applications with increasing the number OpenMP threads per node, and find that increasing the number of threads to some extent saturates or worsens performance of these hybrid applications. For the strong-scaling hybrid scientific applications such as SP-MZ, BT-MZ, PEQdyna and PLMB, using 32 threads per node results in much better application efficiency than using 64 threads per node, and as increasing the number of threads per node, the FPU percentage decreases, and the MPI percentage (except PMLB) and IPC per core (except BT-MZ) increase. For the weak-scaling hybrid scientific application such as GTC, the performance trend (relative speedup) is very similar with increasing number of threads per node no matter how many nodes (32, 128, 512) are used.

show abstract

“…In contrast to the conventional methods in fluid dynamics, which are based on the discretization of macroscopic differential equations, the LBM has the ability to deal efficiently with complex geometrics and topologies [25]. For our experiments, we use the parallel multiblock implementation (extended to 3D problems) of the LBM developed by Yu et al [26].…”

Section: ) Parallel Multiblock Lattice Boltzmann (Pmlb)mentioning

confidence: 99%

Online Adaptive Code Generation and Tuning

Tiwari

Hollingsworth

2011

2011 IEEE International Parallel &Amp; Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-In this paper, we present a runtime compilation and tuning framework for parallel programs. We extend our prior work on our auto-tuner, Active Harmony, for tunable parameters that require code generation (for example, different unroll factors). For such parameters, our auto-tuner generates and compiles new code on-the-fly. Effectively, we merge traditional feedback directed optimization and just-in-time compilation. We show that our system can leverage available parallelism in today's HPC platforms by evaluating different code-variants on different nodes simultaneously. We evaluate our system on two parallel applications and show that our system can improve runtime execution by up to 46% compared to the original version of the program.

show abstract

Performance Analysis and Optimization of Parallel Scientific Applications on CMP Cluster Systems

Cited by 16 publications

References 8 publications

An OpenMP Approach to Modeling Dynamic Earthquake Rupture Along Geometrically Complex Faults on CMP Systems

An OpenMP Approach to Modeling Dynamic Earthquake Rupture Along Geometrically Complex Faults on CMP Systems

Performance Characteristics of Hybrid MPI/OpenMP Scientific Applications on a Largescale Multithreaded BlueGene/Q Supercomputer

Online Adaptive Code Generation and Tuning

Contact Info

Product

Resources

About