Reduction operations play a key role in modern massively data parallel computation. However, current implementations in shared memory programming APIs such as OpenMP are often cause of computation bottlenecks due to the high number of atomic operations involved. We propose a reduction design that takes advantage of the coupling with a barrier synchronization to optimize the execution of the reduction. Experimental results show how the number of atomic operations involved is dramatically reduced, which can lead to significant improvement in scaling properties on large numbers of processing elements. We report a speedup of 59.64% on the 312.swim m SPEC OMP2001 benchmark and a speedup of 24.89% on the streamcluster benchmark from the PARSEC suite over the GCC libgomp baseline.
IntroductionThe rise of multi-core architectures in recent years has led to the widespread need for parallel software. Given the limited improvements in clock rates, exploiting parallel execution is needed to guarantee performance improvements.Parallelism can be exploited at several levels of granularity, from instruction level parallelism to data parallelism to task parallelism. The OpenMP [1] Application Programming Interface (API) aims at providing an easy-to-use way to program parallel applications at multiple levels of granularity, implemented on top of the C and Fortran languages. Specifically, it targets data and task parallelism by providing directives to identify parallel regions of code and parallel loop constructs.OpenMP also offers a reduction clause to provide some support for recursive array computation, inspired by the reduce or fold constructs of functional languages [6].In functional languages such as Lisp or Haskell, reduce is a higher-order operator that takes as input a binary function f , a list l and an initial value v, and is defined recursively as follows:If the binary function f is associative, it is possible to parallelize the reduce operation, executing it in approximately log2(|l|) steps, where each step i computes a set of intermediate results ti by applying f to pairs of values of ti−1.[Copyright notice will appear here once 'preprint' option is removed.]OpenMP support for reduce-like constructs is limited to associative and commutative binary operators and, in the case of Fortran, intrinsic procedures, which are also associative and commutative functions. Arbitrary functions f are not supported.Reduce-inspired constructs are essential for the expression of data parallelism, as they provide the means to express the extraction of synthetic results from large amounts of data. Recent works in the field of distributed computing [9] show that many data parallel computations can be easily expressed in terms of a reduce-like construct paired with a map-like construct. A map construct essentially allows the execution of a given n-ary function on all the n-uples obtained by taking an element from each of n sequences of equal length.In OpenMP, the parallel loop construct provides the basic data parallelism, replaci...