2015
DOI: 10.1109/tc.2014.2345391
|View full text |Cite
|
Sign up to set email alerts
|

Parallel Reproducible Summation

Abstract: Reproducibility, i.e. getting bitwise identical floating point results from multiple runs of the same program, is a property that many users depend on either for debugging or correctness checking in many codes [10]. However, the combination of dynamic scheduling of parallel computing resources, and floating point nonassociativity, makes attaining reproducibility a challenge even for simple reduction operations like computing the sum of a vector of numbers in parallel. We propose a technique for floating point … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
45
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 54 publications
(49 citation statements)
references
References 7 publications
0
45
0
Order By: Relevance
“…As Section 4 shows, our algorithm is faster in the bandwidth-constrained scenarios with moderate dynamic ranges. Demmel and Nguyen have also improved the previous results [28,22] by using one single reduction step among nodes. Such an improvement yielded roughly 20 % overhead on 1024 processors compared to the Intel MKL dasum(), but it shows roughly 3.4 times slowdown on 32 processors.…”
Section: Related Workmentioning
confidence: 80%
See 1 more Smart Citation
“…As Section 4 shows, our algorithm is faster in the bandwidth-constrained scenarios with moderate dynamic ranges. Demmel and Nguyen have also improved the previous results [28,22] by using one single reduction step among nodes. Such an improvement yielded roughly 20 % overhead on 1024 processors compared to the Intel MKL dasum(), but it shows roughly 3.4 times slowdown on 32 processors.…”
Section: Related Workmentioning
confidence: 80%
“…The one reduction reproducible summation, Alg. 6 Sequential Reproducible Summation [22], (referred as ReproBLAS) from the ReproBLAS library 6 ; 5. The single-sweep reduction [23] with two and three levels (cited as bitrep2 and bitrep3, accordingly) from the bitrep library 7 .…”
Section: Baseline Algorithms and Experimental Setupmentioning
confidence: 99%
“…We have implemented an OpenMP parallel version of this algorithm since ReproBLAS offers only an MPI parallel version. We derive reproducible version of dot, nrm2, asum and gemv by replacing all non-associative accumulations by the algorithm OneReduction [6]. These versions are denoted OneReductionDot, OneReductionAsum, OneReductionN rm2 and OneReductionGemv.…”
Section: Implementation and Performance Resultsmentioning
confidence: 99%
“…Therefore numerical results do not depend anymore on hardware configuration. The performance of these latter is improved with the algorithm OneReduction [6] by relying on indexed floating-point numbers [5] and requiring a single reduction operation to reduce the communication cost on distributed memory parallel platforms. However, those solutions do not improve accuracy.…”
Section: Introductionmentioning
confidence: 99%
“…Strong numerical reproducibility can be further subdivided in two classes of algorithms. Those which are producing correctly rounded results such as the one based on long accumulators [7], and others which provide reproducible results without any guaranty on accuracy [8].…”
Section: Introductionmentioning
confidence: 99%