2020
DOI: 10.1002/cpe.5754
|View full text |Cite
|
Sign up to set email alerts
|

Reducing the amount of out‐of‐core data access for GPU‐accelerated randomized SVD

Abstract: Summary We propose two acceleration methods, namely, Fused and Gram, for reducing out‐of‐core data access when performing randomized singular value decomposition (RSVD) on graphics processing units (GPUs). Out‐of‐core data here are data that are too large to fit into the GPU memory at once. Both methods accelerate GPU‐enabled RSVD using the following three schemes: (1) a highly tuned general matrix‐matrix multiplication (GEMM) scheme for processing out‐of‐core data on GPUs; (2) a data‐access reduction scheme b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 10 publications
(3 citation statements)
references
References 55 publications
0
3
0
Order By: Relevance
“…Randomized algorithms typically use nonuniform sampling to select a certain set of row and column vectors from the target matrix, which can achieve an important sampling selection with lower overhead and higher accuracy compared with that of the uniform sampling method. Coupled with large data matrix partition schemes and a partial (or truncated) SVD of a small matrix, randomized SVD algorithms can be implemented in parallel on graphics processing units (GPUs) with the capability of fast matrix multiplications and random number generations to achieve further acceleration [61], [62]. Nevertheless, the computational bottleneck restricting real-time performance still exists in the CPU-GPU transfer bandwidth and vector summation [61], [62] inherent in RPCA-based video decomposition.…”
Section: Rpca-based Foreground/background Separationmentioning
confidence: 99%
See 1 more Smart Citation
“…Randomized algorithms typically use nonuniform sampling to select a certain set of row and column vectors from the target matrix, which can achieve an important sampling selection with lower overhead and higher accuracy compared with that of the uniform sampling method. Coupled with large data matrix partition schemes and a partial (or truncated) SVD of a small matrix, randomized SVD algorithms can be implemented in parallel on graphics processing units (GPUs) with the capability of fast matrix multiplications and random number generations to achieve further acceleration [61], [62]. Nevertheless, the computational bottleneck restricting real-time performance still exists in the CPU-GPU transfer bandwidth and vector summation [61], [62] inherent in RPCA-based video decomposition.…”
Section: Rpca-based Foreground/background Separationmentioning
confidence: 99%
“…Coupled with large data matrix partition schemes and a partial (or truncated) SVD of a small matrix, randomized SVD algorithms can be implemented in parallel on graphics processing units (GPUs) with the capability of fast matrix multiplications and random number generations to achieve further acceleration [61], [62]. Nevertheless, the computational bottleneck restricting real-time performance still exists in the CPU-GPU transfer bandwidth and vector summation [61], [62] inherent in RPCA-based video decomposition.…”
Section: Rpca-based Foreground/background Separationmentioning
confidence: 99%
“…GPUs are more powerful accelerator devices than manycore CPUs for computing-and memory-intensive applications [13]- [16]. CUDA [17] is a parallel computing platform based on C++ which can be used to access the instruction set and computational elements on Nvidia GPUs.…”
Section: Introductionmentioning
confidence: 99%