Parallel Reproducible Summation

Demmel, James; Nguyen, Hong Diep

doi:10.1109/tc.2014.2345391

Cited by 54 publications

(49 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As Section 4 shows, our algorithm is faster in the bandwidth-constrained scenarios with moderate dynamic ranges. Demmel and Nguyen have also improved the previous results [28,22] by using one single reduction step among nodes. Such an improvement yielded roughly 20 % overhead on 1024 processors compared to the Intel MKL dasum(), but it shows roughly 3.4 times slowdown on 32 processors.…”

Section: Related Workmentioning

confidence: 80%

See 1 more Smart Citation

Numerical reproducibility for the parallel reduction on multi- and many-core architectures

et al. 2015

View full text Add to dashboard Cite

Abstract. On modern multi-core, many-core, and heterogeneous architectures, floating-point computations, especially reductions, may become non-deterministic and, therefore, non-reproducible mainly due to the nonassociativity of floating-point operations. We introduce an approach to compute the correctly rounded sums of large floating-point vectors accurately and efficiently, achieving deterministic results by construction. Our multi-level algorithm consists of two main stages: first, a filtering stage that relies on fast vectorized floating-point expansion; second, an accumulation stage based on superaccumulators in a high-radix carry-save representation. We present implementations on recent Intel desktop and server processors, Intel Xeon Phi co-processors, and both AMD and NVIDIA GPUs. We show that numerical reproducibility and bit-perfect accuracy can be achieved at no additional cost for large sums that have dynamic ranges of up to 90 orders of magnitude by leveraging arithmetic units that are left underused by standard reduction algorithms.

show abstract

Section: Related Workmentioning

confidence: 80%

“…The one reduction reproducible summation, Alg. 6 Sequential Reproducible Summation [22], (referred as ReproBLAS) from the ReproBLAS library 6 ; 5. The single-sweep reduction [23] with two and three levels (cited as bitrep2 and bitrep3, accordingly) from the bitrep library 7 .…”

Section: Baseline Algorithms and Experimental Setupmentioning

confidence: 99%

Numerical reproducibility for the parallel reduction on multi- and many-core architectures

et al. 2015

View full text Add to dashboard Cite

show abstract

“…We have implemented an OpenMP parallel version of this algorithm since ReproBLAS offers only an MPI parallel version. We derive reproducible version of dot, nrm2, asum and gemv by replacing all non-associative accumulations by the algorithm OneReduction [6]. These versions are denoted OneReductionDot, OneReductionAsum, OneReductionN rm2 and OneReductionGemv.…”

Section: Implementation and Performance Resultsmentioning

confidence: 99%

“…Therefore numerical results do not depend anymore on hardware configuration. The performance of these latter is improved with the algorithm OneReduction [6] by relying on indexed floating-point numbers [5] and requiring a single reduction operation to reduce the communication cost on distributed memory parallel platforms. However, those solutions do not improve accuracy.…”

Section: Introductionmentioning

confidence: 99%

Reproducible, Accurately Rounded and Efficient BLAS

Chohra

Langlois

Parello

2017

Euro-Par 2016: Parallel Processing Workshops

View full text Add to dashboard Cite

Abstract. Numerical reproducibility failures rise in parallel computation because floating-point summation is non-associative. Massively parallel and optimized executions dynamically modify the floating-point operation order. Hence, numerical results may change from one run to another. We propose to ensure reproducibility by extending as far as possible the IEEE-754 correct rounding property to larger operation sequences. We introduce our RARE-BLAS (Reproducible, Accurately Rounded and Efficient BLAS) that benefits from recent accurate and efficient summation algorithms. Solutions for level 1 (asum, dot and nrm2) and level 2 (gemv) routines are presented. Their performance is studied compared to the Intel MKL library and other existing reproducible algorithms. For both shared and distributed memory parallel systems, we exhibit an extra-cost of 2× in the worst case scenario, which is satisfying for a wide range of applications. For Intel Xeon Phi accelerator a larger extra-cost (4× to 6×) is observed, which is still helpful at least for debugging and validation steps.

show abstract

“…Strong numerical reproducibility can be further subdivided in two classes of algorithms. Those which are producing correctly rounded results such as the one based on long accumulators [7], and others which provide reproducible results without any guaranty on accuracy [8].…”

Section: Introductionmentioning

confidence: 99%

Reproducible floating-point atomic addition in data-parallel environment

Defour¹,

Collange²

2015

Annals of Computer Science and Information Systems

View full text Add to dashboard Cite

Abstract-Floating-point additions in concurrent execution environment are known to be hazardous, as the result depends on the order in which operations are performed. This problem is encountered in data parallel execution environments such as GPUs, where reproducibility involving floating-point atomic addition is challenging. This problem is due to the rounding error or cancellation that appears for each operation, combined with the lack of control over execution order. In this article we propose two solutions to address this problem: work reassignment and fixed-point accumulation. Work reassignment consists in enforcing an execution order that leads to weak reproducibility. Fixed-point accumulation consists in avoiding rounding errors altogether thanks to a long accumulator and enables strong reproducibility.

show abstract

Parallel Reproducible Summation

Cited by 54 publications

References 7 publications

Numerical reproducibility for the parallel reduction on multi- and many-core architectures

Numerical reproducibility for the parallel reduction on multi- and many-core architectures

Reproducible, Accurately Rounded and Efficient BLAS

Reproducible floating-point atomic addition in data-parallel environment

Contact Info

Product

Resources

About