A Programming Model Performance Study Using the NAS Parallel Benchmarks

Shan, Hongzhang; Blagojević, Filip; Min, Seung-Jai; Hargrove, Paul; Jin, Haoqiang; Fuerlinger, Karl; Koniges, Alice; Wright, Nicholas J.

doi:10.1155/2010/715637

Cited by 18 publications

(13 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our conclusions about UPC are compatible with results obtained in previous studies, in particular [8], [9]. We confirm that UPC can compete with both MPI and OpenMP in performance on a single-node.…”

Section: Discussionsupporting

confidence: 91%

See 1 more Smart Citation

On the performance and energy efficiency of the PGAS programming model on multicore architectures

Lagravière¹,

Langguth²,

Sourouri³

et al. 2016

2016 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

Abstract-Using large-scale multicore systems to get the maximum performance and energy efficiency with manageable programmability is a major challenge. The partitioned global address space (PGAS) programming model enhances programmability by providing a global address space over largescale computing systems. However, so far the performance and energy efficiency of the PGAS model on multicore-based parallel architectures have not been investigated thoroughly. In this paper we use a set of selected kernels from the well-known NAS Parallel Benchmarks to evaluate the performance and energy efficiency of the UPC programming language, which is a widely used implementation of the PGAS model. In addition, the MPI and OpenMP versions of the same parallel kernels are used for comparison with their UPC counterparts. The investigated hardware platforms are based on multicore CPUs, both within a single 16-core node and across multiple nodes involving up to 1024 physical cores. On the multi-node platform we used the hardware measurement solution called High definition Energy Efficiency Monitoring tool in order to measure energy. On the single-node system we used the hybrid measurement solution to make an effort into understanding the observed performance differences, we use the Intel Performance Counter Monitor to quantify in detail the communication time, cache hit/miss ratio and memory usage. Our experiments show that UPC is competitive with OpenMP and MPI on single and multiple nodes, with respect to both the performance and energy efficiency.

show abstract

Section: Discussionsupporting

confidence: 91%

“…Even though there is no global winner in the obtained single-node measurements, UPC is able to compete with both As described in [9], on a single-node platform, UPC scales well over more CPU cores and competes well with OpenMP and MPI. However, the performance of MPI or OpenMP is better in many cases.…”

Section: B Measurements On Single-node Architecturementioning

confidence: 93%

On the performance and energy efficiency of the PGAS programming model on multicore architectures

Lagravière¹,

Langguth²,

Sourouri³

et al. 2016

2016 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

show abstract

“…A comparison of performance and programmability between UPC and MPI was given in [23] for a realistic fluid dynamic implementation. For a general comparison between OpenMP, UPC and MPI programming, we refer to [25].…”

Section: Resultsmentioning

confidence: 99%

Performance Optimization and Modeling of Fine-Grained Irregular Communication in UPC

Lagravière

Langguth

Prugger

et al. 2019

Scientific Programming

View full text Add to dashboard Cite

The UPC programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory sub-systems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can come at the cost of substantial performance penalties. This is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a fine-grained pattern. In this paper we study performance enhancement strategies specifically targeting such fine-grained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs that originally require fine-grained irregular communication. Besides the performance enhancement strategies, the main contribution of the present paper is to propose performance models for the different scenarios, in form of quantifiable formulas that hinge on the actual volumes of various data movements plus a small number of easily obtainable hardware characteristic parameters. These performance models help to verify the enhancements obtained, while also providing insightful predictions of similar parallel implementations, not limited to UPC, that also involve between-thread or between-process irregular communication. As a further validation, we also apply our performance modeling methodology and hardware characteristic parameters to an existing UPC code for solving a 2D heat equation on a uniform mesh. MotivationGood programmer productivity and high computational performance are usually two conflicting goals in the context of developing parallel code for scientific computations. Partitioned global address space (PGAS) [10,2,11,22], however, is a parallel programming model that aims to achieve both goals at the same time. The fundamental mechanism of PGAS is a global address space that is conceptually shared among concurrent processes that jointly execute a parallel program. Data exchange between the processes is carried out by a low-level network layer "under the hood" without explicit involvement from the programmer, thus providing good productivity. The shared global address space is logically partitioned such that each partition has affinity to a designated owner process. This awareness of data locality is essential for achieving good performance of parallel programs written in the PGAS model, because the globally shared address space may actually encompass many physically distributed memory sub-systems.Unified Parallel C (UPC) [13,28] is an extension of the C language and provides the PGAS parallel programming model. The concurrent execution processes of UPC are termed as threads, which execute a UPC program in the style of single-program-multiple-...

show abstract

“…To name a few, Nakajima [5] described how to use a three-level hybrid programing model (vectorization, OpenMP, and MPI) to program efficiently on Earth Simulator. Shan et al [7] discussed the advantage of using hybrid MPI+OpenMP programming model for NAS parallel applications. Kaushik et al [4] investigated the performance of implicit PDF simulations for hybrid MPI+OpenMP programming model on a multicore architecture.…”

Section: Related Workmentioning

confidence: 99%

Optimizing the Advanced Accelerator Simulation Framework Synergia Using OpenMP

Shan

Strohmaier

Amundson

et al. 2012

OpenMP in a Heterogeneous World

Self Cite

View full text Add to dashboard Cite

Abstract. Synergia is an advanced accelerator framework widely used by accelerator community. However, its performance suffers significantly from the high communication requirement. In this paper, we address this issue by replacing the flat MPI programming model with the hybrid OpenMP+MPI programming model. We describe in detail how the code has been parallelized in OpenMP and what the challenges are. The improved hybrid code can perform over 1.7 times better than the original program for a benchmark problem.

show abstract

A Programming Model Performance Study Using the NAS Parallel Benchmarks

Cited by 18 publications

References 2 publications

On the performance and energy efficiency of the PGAS programming model on multicore architectures

On the performance and energy efficiency of the PGAS programming model on multicore architectures

Performance Optimization and Modeling of Fine-Grained Irregular Communication in UPC

Optimizing the Advanced Accelerator Simulation Framework Synergia Using OpenMP

Contact Info

Product

Resources

About