Reducing the amount of out‐of‐core data access for GPU‐accelerated randomized SVD

Lu, Yuechao; Yamazaki, Ichitaro; Ino, Fumihiko; Matsushita, Yasuyuki; Tomov, Stanimire; Dongarra, Jack

doi:10.1002/cpe.5754

Cited by 10 publications

(3 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Randomized algorithms typically use nonuniform sampling to select a certain set of row and column vectors from the target matrix, which can achieve an important sampling selection with lower overhead and higher accuracy compared with that of the uniform sampling method. Coupled with large data matrix partition schemes and a partial (or truncated) SVD of a small matrix, randomized SVD algorithms can be implemented in parallel on graphics processing units (GPUs) with the capability of fast matrix multiplications and random number generations to achieve further acceleration [61], [62]. Nevertheless, the computational bottleneck restricting real-time performance still exists in the CPU-GPU transfer bandwidth and vector summation [61], [62] inherent in RPCA-based video decomposition.…”

Section: Rpca-based Foreground/background Separationmentioning

confidence: 99%

“…Coupled with large data matrix partition schemes and a partial (or truncated) SVD of a small matrix, randomized SVD algorithms can be implemented in parallel on graphics processing units (GPUs) with the capability of fast matrix multiplications and random number generations to achieve further acceleration [61], [62]. Nevertheless, the computational bottleneck restricting real-time performance still exists in the CPU-GPU transfer bandwidth and vector summation [61], [62] inherent in RPCA-based video decomposition.…”

Section: Rpca-based Foreground/background Separationmentioning

confidence: 99%

See 1 more Smart Citation

Working memory inspired hierarchical video decomposition with transformative representations

Qin¹,

Mao²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video decomposition is very important to extract moving foreground objects from complex backgrounds in computer vision, machine learning, and medical imaging, e.g., extracting moving contrast-filled vessels from the complex and noisy backgrounds of X-ray coronary angiography (XCA). However, the challenges caused by dynamic backgrounds, overlapping heterogeneous environments and complex noises still exist in video decomposition. To solve these challenges, this study is the first to introduce a flexible visual working memory model in video decomposition to provide interpretable and high-performance hierarchical deep learning architecture, integrating the transformative representations between sensory and control layers from the perspective of visual and cognitive neuroscience. Specifically, robust PCA unrolling networks acting as a structure-regularized sensor layer decompose XCA into sparse/low-rank structured representations to separate moving contrast-filled vessels from noisy and complex backgrounds. Then, patch recurrent convolutional LSTM networks with a backprojection superresolution module embody unstructured random representations of the control layer in working memory, recurrently projecting spatiotemporally decomposed nonlocal patches into orthogonal subspaces for heterogeneous vessel retrieval and interference suppression. This video decomposition architecture effectively restores the heterogeneous profiles of intensity and geometry of moving objects against the complex background interferences. Experiments show that the proposed method significantly outperforms state-of-the-art methods in accurate moving contrast-filled vessel extraction with excellent flexibility and computational efficiency.

show abstract

Section: Rpca-based Foreground/background Separationmentioning

confidence: 99%

Section: Rpca-based Foreground/background Separationmentioning

confidence: 99%

Working memory inspired hierarchical video decomposition with transformative representations

Qin¹,

Mao²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…GPUs are more powerful accelerator devices than manycore CPUs for computing-and memory-intensive applications [13]- [16]. CUDA [17] is a parallel computing platform based on C++ which can be used to access the instruction set and computational elements on Nvidia GPUs.…”

Section: Introductionmentioning

confidence: 99%

Integrating GPU support for FreeSurfer with OpenACC

Shen

Mei

Walldén

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

FreeSurfer is among the most widely used suites of software for the study of cortical and subcortical brain anatomy. However, analysis using FreeSurfer can be time-consuming and it lacks support for the graphics processing units (GPUs) after the core development team stopped maintaining GPU-accelerated versions due to significant programming cost. As FreeSurfer is a large project with millions of source lines, in this work, we introduce and examine the use of a directive-based framework, OpenACC, in GPU acceleration of FreeSurfer, and we found the OpenACC-based approach significantly reduces programming costs. Moreover, because the overhead incurred by CPU-to-GPU data transfer is the major challenge in delivering GPU-based codes of high performance, we compare two schemes, copy-and-transfer and overlapped-fully-transfer, to reduce such data transfer overhead. Experimental results show that the target function we accelerated with overlapped-fully-transfer scheme ran 2.3x as fast as the original CPU-based function, and the GPU-accelerated program achieved an average speedup of 1.2x compared to the original CPU-based program. These results demonstrate the usefulness and potential of utilizing the proposed OpenACC-based approach to integrate GPU support for FreeSurfer which can be easily extended to other computationally expensive functions and modules of FreeSurfer to achieve further speedup.

show abstract

Modeling root water uptake patterns of oil crops grown on semiarid loess

Zhao

et al. 2023

Agricultural and Forest Meteorology

View full text Add to dashboard Cite

Reducing the amount of out‐of‐core data access for GPU‐accelerated randomized SVD

Cited by 10 publications

References 55 publications

Working memory inspired hierarchical video decomposition with transformative representations

Working memory inspired hierarchical video decomposition with transformative representations

Integrating GPU support for FreeSurfer with OpenACC

Modeling root water uptake patterns of oil crops grown on semiarid loess

Contact Info

Product

Resources

About