Abstract:The Preconditioned Conjugate Gradient method is often used in numerical simulations. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we aim at a twofold goal: enhance the accuracy of the solver but also ensure its reproducibility in a message-passing implementation. We design and employ various strategies starting from the ExBLAS approach (through preserving every bit of information until final rounding) to its more lightweight performan… Show more
“…The ExBLAS and Opt implementations deliver both accurate and reproducible results that are identical with the MPFR library. Note that these results are identical to the ones from the pure MPI implementations in Iakymchuk et al (2019a) and only the results of the original code differ. The original code shows the difference from one digit on the initial iteration and up to 5 digits on the 45th iteration on 48 cores (8 MPI processes with 6 OpenMP threads per each).…”
Section: Resultssupporting
confidence: 68%
“…Motivated by “100 bits suffice for many HPC applications” as noted by David Bailey at ARITH-21 Bailey (2013) and a mini accumulator from the ARM team Lutz and Hinds (2017); Burgess et al (2019), we derive a faster but less generic version using FPEs, which is the other core algorithmic component in the ExBLAS approach, aiming to adjust the algorithm to the problem at hand. As a consequence, we also address the common issue of sparse iterative solvers—the accuracy while computing the residual—and propose to use solutions that offer reproducibility (and potentially correct-rounding) only while computing the corresponding dot products. Hence, we derive two hybrid (MPI + OpenMP tasks), reproducible, and accurate dot products using ExBLAS and FPEs. Finally, we demonstrate applicability and feasibility of the aforementioned idea with the ExBLAS- and FPE-based approaches in the hybrid MPI + OpenMP implementation of PCG on an example of a 3D Poisson’s equation with 27 stencil points as well as several test matrices from the SuiteSparse matrix collection. This extends our previous results with the pure MPI implementation of PGC Iakymchuk et al (2019a) to the more complex double-level dot products and reductions with dynamic scheduling of the tasks. …”
Section: Introductionsupporting
confidence: 82%
“…Finally, we demonstrate applicability and feasibility of the aforementioned idea with the ExBLAS- and FPE-based approaches in the hybrid MPI + OpenMP implementation of PCG on an example of a 3D Poisson’s equation with 27 stencil points as well as several test matrices from the SuiteSparse matrix collection. This extends our previous results with the pure MPI implementation of PGC Iakymchuk et al (2019a) to the more complex double-level dot products and reductions with dynamic scheduling of the tasks.…”
Section: Introductionsupporting
confidence: 82%
“…from 18% to 6% in MN4, on the large core count: this is due to very similar implementations of both since Exblas underneath relies upon FPE8EE for the OpenMP dot products. Note that such difference is much larger for the pure MPI implementation Iakymchuk et al (2019a).…”
Section: Resultsmentioning
confidence: 94%
“…In addition, we conduct experiments using the pure MPI versions of the Reproducible Preconditioned Conjugate Gradient Iakymchuk et al (2019a) on the MareNostrum4 and Tintorrum clusters, see Tables 6 and 7. We observe that the number of iterations, residuals, direct errors, the final error, and vector-solutions are identical to those produced by the MPI+OpenMP tasks versions.…”
The Preconditioned Conjugate Gradient method is often employed for the solution of linear systems of equations arising in numerical simulations of physical phenomena. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we propose two algorithmic solutions that originate from the ExBLAS project to enhance the accuracy of the solver as well as to ensure its reproducibility in a hybrid MPI + OpenMP tasks programming environment. One is based on ExBLAS and preserves every bit of information until the final rounding, while the other relies upon floating-point expansions and, hence, expands the intermediate precision. Instead of converting the entire solver into its ExBLAS-related implementation, we identify those parts that violate reproducibility/non-associativity, secure them, and combine this with the sequential executions. These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these approaches on two modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of less than 37.7% on 768 cores.
“…The ExBLAS and Opt implementations deliver both accurate and reproducible results that are identical with the MPFR library. Note that these results are identical to the ones from the pure MPI implementations in Iakymchuk et al (2019a) and only the results of the original code differ. The original code shows the difference from one digit on the initial iteration and up to 5 digits on the 45th iteration on 48 cores (8 MPI processes with 6 OpenMP threads per each).…”
Section: Resultssupporting
confidence: 68%
“…Motivated by “100 bits suffice for many HPC applications” as noted by David Bailey at ARITH-21 Bailey (2013) and a mini accumulator from the ARM team Lutz and Hinds (2017); Burgess et al (2019), we derive a faster but less generic version using FPEs, which is the other core algorithmic component in the ExBLAS approach, aiming to adjust the algorithm to the problem at hand. As a consequence, we also address the common issue of sparse iterative solvers—the accuracy while computing the residual—and propose to use solutions that offer reproducibility (and potentially correct-rounding) only while computing the corresponding dot products. Hence, we derive two hybrid (MPI + OpenMP tasks), reproducible, and accurate dot products using ExBLAS and FPEs. Finally, we demonstrate applicability and feasibility of the aforementioned idea with the ExBLAS- and FPE-based approaches in the hybrid MPI + OpenMP implementation of PCG on an example of a 3D Poisson’s equation with 27 stencil points as well as several test matrices from the SuiteSparse matrix collection. This extends our previous results with the pure MPI implementation of PGC Iakymchuk et al (2019a) to the more complex double-level dot products and reductions with dynamic scheduling of the tasks. …”
Section: Introductionsupporting
confidence: 82%
“…Finally, we demonstrate applicability and feasibility of the aforementioned idea with the ExBLAS- and FPE-based approaches in the hybrid MPI + OpenMP implementation of PCG on an example of a 3D Poisson’s equation with 27 stencil points as well as several test matrices from the SuiteSparse matrix collection. This extends our previous results with the pure MPI implementation of PGC Iakymchuk et al (2019a) to the more complex double-level dot products and reductions with dynamic scheduling of the tasks.…”
Section: Introductionsupporting
confidence: 82%
“…from 18% to 6% in MN4, on the large core count: this is due to very similar implementations of both since Exblas underneath relies upon FPE8EE for the OpenMP dot products. Note that such difference is much larger for the pure MPI implementation Iakymchuk et al (2019a).…”
Section: Resultsmentioning
confidence: 94%
“…In addition, we conduct experiments using the pure MPI versions of the Reproducible Preconditioned Conjugate Gradient Iakymchuk et al (2019a) on the MareNostrum4 and Tintorrum clusters, see Tables 6 and 7. We observe that the number of iterations, residuals, direct errors, the final error, and vector-solutions are identical to those produced by the MPI+OpenMP tasks versions.…”
The Preconditioned Conjugate Gradient method is often employed for the solution of linear systems of equations arising in numerical simulations of physical phenomena. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we propose two algorithmic solutions that originate from the ExBLAS project to enhance the accuracy of the solver as well as to ensure its reproducibility in a hybrid MPI + OpenMP tasks programming environment. One is based on ExBLAS and preserves every bit of information until the final rounding, while the other relies upon floating-point expansions and, hence, expands the intermediate precision. Instead of converting the entire solver into its ExBLAS-related implementation, we identify those parts that violate reproducibility/non-associativity, secure them, and combine this with the sequential executions. These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these approaches on two modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of less than 37.7% on 768 cores.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.