2018
DOI: 10.1137/17m1117732
|View full text |Cite
|
Sign up to set email alerts
|

The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale

Abstract: The computation of the singular value decomposition, or SVD, has a long history with many improvements over the years, both in its implementations and algorithmically. Here, we survey the evolution of SVD algorithms for dense matrices, discussing the motivation and performance impacts of changes. There are two main branches of dense SVD methods: bidiagonalization and Jacobi. Bidiagonalization methods started with the implementation by Golub and Reinsch in Algol60, which was subsequently ported to Fortran in th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
41
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3

Relationship

1
5

Authors

Journals

citations
Cited by 81 publications
(41 citation statements)
references
References 100 publications
0
41
0
Order By: Relevance
“…From a computational point of view evaluating (5.1) is not expensive because B 1 is a diagonal matrix, and m n, that is Y is a tall, skinny matrix, and therefore σ m (Y ) can be computed very efficiently [17,Sec 5.4]. Similarly the expression for µ B in (3.20) can also be simplified, which gives…”
Section: Application To the Stochastic Galerkin Methodsmentioning
confidence: 99%
“…From a computational point of view evaluating (5.1) is not expensive because B 1 is a diagonal matrix, and m n, that is Y is a tall, skinny matrix, and therefore σ m (Y ) can be computed very efficiently [17,Sec 5.4]. Similarly the expression for µ B in (3.20) can also be simplified, which gives…”
Section: Application To the Stochastic Galerkin Methodsmentioning
confidence: 99%
“…EISPACK was designed to run on a single‐core CPU and was replaced by LINPACK, which first implemented the SVD algorithm with basic linear algebra subprogram (BLAS) interface. The performance of LINPACK was limited by the BLAS1 implementation and benefited little from multicore architectures . LAPACK redesigned the SVD algorithm to use BLAS3 routines wherever possible to improve the performance on the multicore CPUs.…”
Section: Related Workmentioning
confidence: 99%
“…This low efficiency is due to the computation of tall‐skinny GEMM, which is closer to GEMV (BLAS2 routines) than GEMM (BLAS3 routines). The BLAS2 routines are less efficient than the BLAS3 routines due to vector accesses that degrade cache hit rate on both the multicore CPU and GPU; BLAS3 routines are 20 to 40 times more efficient than BLAS2 routines . Regarding the in‐core performance of tall‐skinny GEMM, Chen et al achieved 1.1 to 3.0× speedups over cuBLAS for tall‐skinny matrices with up to 16 columns.…”
Section: Gpu‐accelerated Out‐of‐core Gemmmentioning
confidence: 99%
See 2 more Smart Citations