2015
DOI: 10.1007/978-3-319-17353-5_2
|View full text |Cite
|
Sign up to set email alerts
|

Mixed-Precision Orthogonalization Scheme and Adaptive Step Size for Improving the Stability and Performance of CA-GMRES on GPUs

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
15
0

Year Published

2015
2015
2024
2024

Publication Types

Select...
6
2

Relationship

4
4

Authors

Journals

citations
Cited by 14 publications
(15 citation statements)
references
References 5 publications
0
15
0
Order By: Relevance
“…In addition, we compliment our initial report with the following studies: (1) performance studies of the mixed-precision CholQR in the working 32-bit single precision with a GPU, where the higher precision is the 64-bit double precision and is supported by the hardware; (2) case studies with CA-GMRES on multiple GPUs (i.e., our initial study [27] used only one GPU); and (3) case studies with a communication-avoiding variant [13] of the Lanczos method [16], called CA-Lanczos, for solving a symmetric eigenvalue problem. The rest of the paper is organized as follows: First, in section 2, we discuss our implementations of several existing tall-skinny orthogonalization procedures, including CholQR, on a multicore CPU with multiple GPUs.…”
Section: Introductionmentioning
confidence: 86%
See 1 more Smart Citation
“…In addition, we compliment our initial report with the following studies: (1) performance studies of the mixed-precision CholQR in the working 32-bit single precision with a GPU, where the higher precision is the 64-bit double precision and is supported by the hardware; (2) case studies with CA-GMRES on multiple GPUs (i.e., our initial study [27] used only one GPU); and (3) case studies with a communication-avoiding variant [13] of the Lanczos method [16], called CA-Lanczos, for solving a symmetric eigenvalue problem. The rest of the paper is organized as follows: First, in section 2, we discuss our implementations of several existing tall-skinny orthogonalization procedures, including CholQR, on a multicore CPU with multiple GPUs.…”
Section: Introductionmentioning
confidence: 86%
“…The main contributions of this paper, over our initial reports on the mixedprecision CholQR [27], are (1) theoretical analysis of the mixed-precision CholQR, deriving upper bounds on the orthogonality error and the condition number of the computed orthogonal matrix; and (2) numerical results to study the stability of the mixed-precision CholQR in practice. In addition, we compliment our initial report with the following studies: (1) performance studies of the mixed-precision CholQR in the working 32-bit single precision with a GPU, where the higher precision is the 64-bit double precision and is supported by the hardware; (2) case studies with CA-GMRES on multiple GPUs (i.e., our initial study [27] used only one GPU); and (3) case studies with a communication-avoiding variant [13] of the Lanczos method [16], called CA-Lanczos, for solving a symmetric eigenvalue problem.…”
Section: Introductionmentioning
confidence: 99%
“…We also plan to study other partitioning algorithms (e.g., hypergraph partitioning), other orthogonalization strategies (e.g., rank-revealing QR with column pivoting [10] or the use of a mixed-precision arithmetic [23]), and adaptive schemes to select or switch orthogonalization strategies or to adjust input parameters (e.g., m and s [23]). Finally, our performance results demonstrated that though MPK could obtain a speedup of up to 1.3 over SpMV, it can be slower due to the overheads traded for reducing the communication latency.…”
Section: Discussionmentioning
confidence: 99%
“…Our code was compiled using the GNU gcc 4.4.6 compiler and CUDA nvcc 4.2 compiler with the optimization flag -O3, and linked with MKL 2011 sp1.8.273 7. We are investigating other batched kernels (e.g., GEMV, SYRK, and GEQRF) and the potential of using an auto-tuner to improve the performance (see[23]). The performance of CholQR/SVQR also depends on the triangularsolve on a tall-skinny matrix, where we use MAGMA DTRSM that is developed for the Cholesky or LU factorization.…”
mentioning
confidence: 99%
“…These routines are highly optimized in our GPU implementation, especially the GEMMs, which due to the specific sizes of the matrices involved -tall and skinny matrices A and B with a small square resulting matrices A T B -required modifications to the standard GEMM algorithm for large matrices [53]. What worked very well is splitting the A T B GEMM into smaller GEMMs based on tuning the MAGMA GEMM [53] for the particular small sizes, all grouped for execution into a single batched GEMM, followed by the addition of the local results [54].…”
Section: Runtime and Energy Analysis Of Lobpcgmentioning
confidence: 97%