Mixed-Precision Orthogonalization Scheme and Adaptive Step Size for Improving the Stability and Performance of CA-GMRES on GPUs

Yamazaki, Ichitaro; Tomov, Stanimire; Dong, Tingxing; Dongarra, Jack

doi:10.1007/978-3-319-17353-5_2

Cited by 14 publications

(15 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, we compliment our initial report with the following studies: (1) performance studies of the mixed-precision CholQR in the working 32-bit single precision with a GPU, where the higher precision is the 64-bit double precision and is supported by the hardware; (2) case studies with CA-GMRES on multiple GPUs (i.e., our initial study [27] used only one GPU); and (3) case studies with a communication-avoiding variant [13] of the Lanczos method [16], called CA-Lanczos, for solving a symmetric eigenvalue problem. The rest of the paper is organized as follows: First, in section 2, we discuss our implementations of several existing tall-skinny orthogonalization procedures, including CholQR, on a multicore CPU with multiple GPUs.…”

Section: Introductionmentioning

confidence: 86%

“…The main contributions of this paper, over our initial reports on the mixedprecision CholQR [27], are (1) theoretical analysis of the mixed-precision CholQR, deriving upper bounds on the orthogonality error and the condition number of the computed orthogonal matrix; and (2) numerical results to study the stability of the mixed-precision CholQR in practice. In addition, we compliment our initial report with the following studies: (1) performance studies of the mixed-precision CholQR in the working 32-bit single precision with a GPU, where the higher precision is the 64-bit double precision and is supported by the hardware; (2) case studies with CA-GMRES on multiple GPUs (i.e., our initial study [27] used only one GPU); and (3) case studies with a communication-avoiding variant [13] of the Lanczos method [16], called CA-Lanczos, for solving a symmetric eigenvalue problem.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs

Yamazaki¹,

Tomov²,

Dongarra³

2015

SIAM J. Sci. Comput.

Self Cite

View full text Add to dashboard Cite

Abstract. To orthonormalize the columns of a dense matrix, the Cholesky QR (CholQR) requires only one global reduction between the parallel processing units and performs most of its computation using BLAS-3 kernels. As a result, compared to other orthogonalization algorithms, CholQR obtains superior performance on many of the current computer architectures, where the communication is becoming increasingly expensive compared to the arithmetic operations. This is especially true when the input matrix is tall-skinny. Unfortunately, the orthogonality error of CholQR depends quadratically on the condition number of the input matrix, and it is numerically unstable when the matrix is ill-conditioned. To enhance the stability of CholQR, we recently used mixed-precision arithmetic; the input and output matrices are in the working precision, but some of its intermediate results are accumulated in the doubled precision. In this paper, we analyze the numerical properties of this mixed-precision CholQR. Our analysis shows that by selectively using the doubled precision, the orthogonality error of the mixed-precision CholQR only depends linearly on the condition number of the input matrix. We provide numerical results to demonstrate the improved numerical stability of the mixed-precision CholQR in practice. We then study its performance. When the target hardware does not support the desired higher precision, software emulation is needed. For example, using software-emulated double-double precision for the working 64-bit double precision, the mixed-precision CholQR requires about 8.5× more floating-point instructions than that required by the standard CholQR. On the other hand, the increase in the communication cost using the double-double precision is less significant, and our performance results on multicore CPU with a different graphics processing unit (GPU) demonstrate that the overhead of using the double-double arithmetic is decreasing on a newer architecture, where the computation is becoming less expensive compared to the communication. As a result, with a latest NVIDIA GPU, the mixed-precision CholQR was only 1.4× slower than the standard CholQR. Finally, we present case studies of using the mixed-precision CholQR within communication-avoiding variants of Krylov subspace projection methods for solving a nonsymmetric linear system of equations and for solving a symmetric eigenvalue problem, on a multicore CPU with multiple GPUs. These case studies demonstrate that by using the higher precision for this small but critical segment of the Krylov methods, we can improve not only the overall numerical stability of the solvers but also, in some cases, their performance.

show abstract

Section: Introductionmentioning

confidence: 86%

Section: Introductionmentioning

confidence: 99%

Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs

Yamazaki¹,

Tomov²,

Dongarra³

2015

SIAM J. Sci. Comput.

Self Cite

View full text Add to dashboard Cite

show abstract

“…We also plan to study other partitioning algorithms (e.g., hypergraph partitioning), other orthogonalization strategies (e.g., rank-revealing QR with column pivoting [10] or the use of a mixed-precision arithmetic [23]), and adaptive schemes to select or switch orthogonalization strategies or to adjust input parameters (e.g., m and s [23]). Finally, our performance results demonstrated that though MPK could obtain a speedup of up to 1.3 over SpMV, it can be slower due to the overheads traded for reducing the communication latency.…”

Section: Discussionmentioning

confidence: 99%

“…Our code was compiled using the GNU gcc 4.4.6 compiler and CUDA nvcc 4.2 compiler with the optimization flag -O3, and linked with MKL 2011 sp1.8.273 7. We are investigating other batched kernels (e.g., GEMV, SYRK, and GEQRF) and the potential of using an auto-tuner to improve the performance (see[23]). The performance of CholQR/SVQR also depends on the triangularsolve on a tall-skinny matrix, where we use MAGMA DTRSM that is developed for the Cholesky or LU factorization.…”

mentioning

confidence: 99%

Improving the Performance of CA-GMRES on Multicores with Multiple GPUs

Yamazaki

Anzt

Tomov

et al. 2014

2014 IEEE 28th International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

The Generalized Minimum Residual (GMRES) method is one of the most widely-used iterative methods for solving nonsymmetric linear systems of equations. In recent years, techniques to avoid communication in GMRES have gained attention because in comparison to floating-point operations, communication is becoming increasingly expensive on modern computers. Since graphics processing units (GPUs) are now becoming crucial component in computing, we investigate the effectiveness of these techniques on multicore CPUs with multiple GPUs. While we present the detailed performance studies of a matrix powers kernel on multiple GPUs, we particularly focus on orthogonalization strategies that have a great impact on both the numerical stability and performance of GMRES, especially as the matrix becomes sparser or ill-conditioned. We present the experimental results on two eight-core Intel Sandy Bridge CPUs with three NDIVIA Fermi GPUs and demonstrate that significant speedups can be obtained by avoiding communication, either on a GPU or between the GPUs. As part of our study, we investigate several optimization techniques for the GPU kernels that can also be used in other iterative solvers besides GMRES. Hence, our studies not only emphasize the importance of avoiding communication on GPUs, but they also provide insight about the effects of these optimization techniques on the performance of the sparse solvers, and may have greater impact beyond GMRES.

show abstract

“…These routines are highly optimized in our GPU implementation, especially the GEMMs, which due to the specific sizes of the matrices involved -tall and skinny matrices A and B with a small square resulting matrices A T B -required modifications to the standard GEMM algorithm for large matrices [53]. What worked very well is splitting the A T B GEMM into smaller GEMMs based on tuning the MAGMA GEMM [53] for the particular small sizes, all grouped for execution into a single batched GEMM, followed by the addition of the local results [54].…”

Section: Runtime and Energy Analysis Of Lobpcgmentioning

confidence: 97%

Energy efficiency and performance frontiers for sparse computations on GPU supercomputers

Anzt

Tomov

Dongarra

2015

Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores

Self Cite

View full text Add to dashboard Cite

In this paper we unveil some energy efficiency and performance frontiers for sparse computations on GPU-based supercomputers. To do this, we consider state-of-the-art implementations of the sparse matrix-vector (SpMV) product in libraries like cuSPARSE, MKL, and MAGMA, and their use in the LOBPCG eigen-solver. LOBPCG is chosen as a benchmark for this study as it combines an interesting mix of sparse and dense linear algebra operations with potential for hardware-aware optimizations. Most notably, LOBPCG includes a blocking technique that is a common performance optimization for many applications. In particular, multiple memory-bound SpMV operations are blocked into a SpM-matrix product (SpMM), that achieves significantly higher performance than a sequence of SpMVs. We provide details about the GPU kernels we use for the SpMV, SpMM, and the LOBPCG implementation design, and study performance and energy consumption compared to CPU solutions. While a typical sparse computation like the SpMV reaches only a fraction of the peak of current GPUs, we show that the SpMM achieves up to a 6× performance improvement over the GPU's SpMV, and the GPU-accelerated LOBPCG based on this kernel is 3 to 5× faster than multicore CPUs with the same power draw, e.g., a K40 GPU vs. two Sandy Bridge CPUs (16 cores). In practice though, we show that currently available CPU implementations are much slower due to missed optimization opportunities. These performance results translate to similar improvements in energy consumption, and are indicative of today's frontiers in energy efficiency and performance for sparse computations on supercomputers.

show abstract

Mixed-Precision Orthogonalization Scheme and Adaptive Step Size for Improving the Stability and Performance of CA-GMRES on GPUs

Cited by 14 publications

References 5 publications

Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs

Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs

Improving the Performance of CA-GMRES on Multicores with Multiple GPUs

Energy efficiency and performance frontiers for sparse computations on GPU supercomputers

Contact Info

Product

Resources

About