Reproducibility strategies for parallel Preconditioned Conjugate Gradient

Iakymchuk, Roman; Barreda, Maria; Wiesenberger, Matthias; Aliaga, José I.; Quintana–Ort́ı, Enrique S.

doi:10.1016/j.cam.2019.112697

Cited by 9 publications

(19 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The ExBLAS and Opt implementations deliver both accurate and reproducible results that are identical with the MPFR library. Note that these results are identical to the ones from the pure MPI implementations in Iakymchuk et al (2019a) and only the results of the original code differ. The original code shows the difference from one digit on the initial iteration and up to 5 digits on the 45th iteration on 48 cores (8 MPI processes with 6 OpenMP threads per each).…”

Section: Resultssupporting

confidence: 68%

“…Motivated by “100 bits suffice for many HPC applications” as noted by David Bailey at ARITH-21 Bailey (2013) and a mini accumulator from the ARM team Lutz and Hinds (2017); Burgess et al (2019), we derive a faster but less generic version using FPEs, which is the other core algorithmic component in the ExBLAS approach, aiming to adjust the algorithm to the problem at hand. As a consequence, we also address the common issue of sparse iterative solvers—the accuracy while computing the residual—and propose to use solutions that offer reproducibility (and potentially correct-rounding) only while computing the corresponding dot products. Hence, we derive two hybrid (MPI + OpenMP tasks), reproducible, and accurate dot products using ExBLAS and FPEs. Finally, we demonstrate applicability and feasibility of the aforementioned idea with the ExBLAS- and FPE-based approaches in the hybrid MPI + OpenMP implementation of PCG on an example of a 3D Poisson’s equation with 27 stencil points as well as several test matrices from the SuiteSparse matrix collection. This extends our previous results with the pure MPI implementation of PGC Iakymchuk et al (2019a) to the more complex double-level dot products and reductions with dynamic scheduling of the tasks. …”

Section: Introductionsupporting

confidence: 82%

“…Finally, we demonstrate applicability and feasibility of the aforementioned idea with the ExBLAS- and FPE-based approaches in the hybrid MPI + OpenMP implementation of PCG on an example of a 3D Poisson’s equation with 27 stencil points as well as several test matrices from the SuiteSparse matrix collection. This extends our previous results with the pure MPI implementation of PGC Iakymchuk et al (2019a) to the more complex double-level dot products and reductions with dynamic scheduling of the tasks.…”

Section: Introductionsupporting

confidence: 82%

“…from 18% to 6% in MN4, on the large core count: this is due to very similar implementations of both since Exblas underneath relies upon FPE8EE for the OpenMP dot products. Note that such difference is much larger for the pure MPI implementation Iakymchuk et al (2019a).…”

Section: Resultsmentioning

confidence: 94%

“…In addition, we conduct experiments using the pure MPI versions of the Reproducible Preconditioned Conjugate Gradient Iakymchuk et al (2019a) on the MareNostrum4 and Tintorrum clusters, see Tables 6 and 7. We observe that the number of iterations, residuals, direct errors, the final error, and vector-solutions are identical to those produced by the MPI+OpenMP tasks versions.…”

Section: Resultsmentioning

confidence: 99%

See 4 more Smart Citations

Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments

Iakymchuk

Barreda

Graillat³

et al. 2020

The International Journal of High Performance Computing Applica

Self Cite

View full text Add to dashboard Cite

The Preconditioned Conjugate Gradient method is often employed for the solution of linear systems of equations arising in numerical simulations of physical phenomena. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we propose two algorithmic solutions that originate from the ExBLAS project to enhance the accuracy of the solver as well as to ensure its reproducibility in a hybrid MPI + OpenMP tasks programming environment. One is based on ExBLAS and preserves every bit of information until the final rounding, while the other relies upon floating-point expansions and, hence, expands the intermediate precision. Instead of converting the entire solver into its ExBLAS-related implementation, we identify those parts that violate reproducibility/non-associativity, secure them, and combine this with the sequential executions. These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these approaches on two modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of less than 37.7% on 768 cores.

show abstract

Section: Resultssupporting

confidence: 68%