Scalability Analysis of the SPEC OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors

Fürlinger, Karl; Gerndt, Michael; Dongarra, Jack

doi:10.1007/978-3-540-72586-2_115

Cited by 8 publications

(3 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fürlinger et al [11] show the bottlenecks for the SPEC OMP2001 benchmarks. According to their analysis, 312.swim m and 310.wupwise m are the only benchmark in the suite where reductions have a significant impact, though 310.wupwise m uses complex data types, and thus is not optimisable in our framework.…”

Section: Benchmarksmentioning

confidence: 99%

See 1 more Smart Citation

An Optimized Reduction Design to Minimize Atomic Operations in Shared Memory Multiprocessors

Speziale

Biagio

Agosta

2011

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PHD Forum

View full text Add to dashboard Cite

Reduction operations play a key role in modern massively data parallel computation. However, current implementations in shared memory programming APIs such as OpenMP are often cause of computation bottlenecks due to the high number of atomic operations involved. We propose a reduction design that takes advantage of the coupling with a barrier synchronization to optimize the execution of the reduction. Experimental results show how the number of atomic operations involved is dramatically reduced, which can lead to significant improvement in scaling properties on large numbers of processing elements. We report a speedup of 59.64% on the 312.swim m SPEC OMP2001 benchmark and a speedup of 24.89% on the streamcluster benchmark from the PARSEC suite over the GCC libgomp baseline. IntroductionThe rise of multi-core architectures in recent years has led to the widespread need for parallel software. Given the limited improvements in clock rates, exploiting parallel execution is needed to guarantee performance improvements.Parallelism can be exploited at several levels of granularity, from instruction level parallelism to data parallelism to task parallelism. The OpenMP [1] Application Programming Interface (API) aims at providing an easy-to-use way to program parallel applications at multiple levels of granularity, implemented on top of the C and Fortran languages. Specifically, it targets data and task parallelism by providing directives to identify parallel regions of code and parallel loop constructs.OpenMP also offers a reduction clause to provide some support for recursive array computation, inspired by the reduce or fold constructs of functional languages [6].In functional languages such as Lisp or Haskell, reduce is a higher-order operator that takes as input a binary function f , a list l and an initial value v, and is defined recursively as follows:If the binary function f is associative, it is possible to parallelize the reduce operation, executing it in approximately log2(|l|) steps, where each step i computes a set of intermediate results ti by applying f to pairs of values of ti−1.[Copyright notice will appear here once 'preprint' option is removed.]OpenMP support for reduce-like constructs is limited to associative and commutative binary operators and, in the case of Fortran, intrinsic procedures, which are also associative and commutative functions. Arbitrary functions f are not supported.Reduce-inspired constructs are essential for the expression of data parallelism, as they provide the means to express the extraction of synthetic results from large amounts of data. Recent works in the field of distributed computing [9] show that many data parallel computations can be easily expressed in terms of a reduce-like construct paired with a map-like construct. A map construct essentially allows the execution of a given n-ary function on all the n-uples obtained by taking an element from each of n sequences of equal length.In OpenMP, the parallel loop construct provides the basic data parallelism, replaci...

show abstract

Section: Benchmarksmentioning

confidence: 99%

“…This synchronization overhead leads the reduction step to cause loss of scalability, to the point where reduction overhead can become a critical issue, as shown in [11] for the 312.swim m SPEC OMP2001 benchmark.…”

Section: Introductionmentioning

confidence: 99%

An Optimized Reduction Design to Minimize Atomic Operations in Shared Memory Multiprocessors

Speziale

Biagio

Agosta

2011

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PHD Forum

View full text Add to dashboard Cite

show abstract

“…To design the embedded system appropriately and to evaluate its performance, many benchmark consortiums have released benchmark suites, such as the "SPEC CPU" [2]. Furthermore, scenario-based benchmark programs that can consider the consumer are being researched [3].…”

mentioning

confidence: 99%

A roofline model based on working set size for embedded systems

Cho

Yun

Jeon

2014

IEICE Electron. Express

View full text Add to dashboard Cite

A "roofline model" is a system performance and optimization guide for programmers and system engineers to apply in the design of future architectures. We review a conventional roofline model and propose a working set size (WSS)-based roofline model. Because of the recent increase in the performance of embedded systems, we investigated embedded systems based on an ARM architecture. Our proposed scheme presents practical guidelines for a multi-core architecture when the program uses a small WSS.

show abstract

Performance of OpenMP Benchmarks on Multicore Processors

Marowka

Algorithms and Architectures for Parallel Processing

View full text Add to dashboard Cite

Scalability Analysis of the SPEC OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors

Cited by 8 publications

References 7 publications

An Optimized Reduction Design to Minimize Atomic Operations in Shared Memory Multiprocessors

An Optimized Reduction Design to Minimize Atomic Operations in Shared Memory Multiprocessors

A roofline model based on working set size for embedded systems

Performance of OpenMP Benchmarks on Multicore Processors

Contact Info

Product

Resources

About