A GPU Framework for Sparse Matrix Vector Multiplication

Neelima, B.; Reddy, G. Ram Mohana; Raghavendra, Prakash S.

doi:10.1109/ispdc.2014.10

Cited by 10 publications

(3 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We leveraged the GPU-accelerated capabilities of the Nvidia Xavier platform, employing PyCUDA [22,23] to develop kernels aimed at optimizing bottleneck operations. Our optimization strategies included the use of shared memory with padding to prevent bank conflicts [24], coalescing global memory accesses for increased throughput [25], pre-computation of constants to diminish runtime calculations [26], loop unrolling [27], and warp divergence minimization through conditional optimization [4]. We also selected faster arithmetic operations when the precision requirements permitted and minimized synchronization needs.…”

Section: Introductionmentioning

confidence: 99%

Accelerating the Fast Hadamard Single-Pixel Imaging (FHSI) on GPUs: Techniques and Optimizations

Quero,

Durini,

Rangel-Magdaleno

et al. 2024

Preprint

View full text Add to dashboard Cite

The recent advancements in edge computing power are primarily attributable to technological innovations enabling accelerators with extensive hardware parallelism. One practical application is in computer imaging (CI), where GPU acceleration is pivotal, especially in reconstructing 2D images through techniques like Single-Pixel Imaging (SPI). In SPI, compressive sensing (CS) algorithms, deep learning, and Fourier transformation are essential for 2D image reconstruction. These algorithms derive substantial performance enhancements through parallelism, thereby reducing processing times. These techniques fully utilize the potential of the GPU by implementing several strategies. These include optimizing memory accessed, expanding loops for efficiency, designing effective computational kernels to reduce the number of operations, using asynchronous operations for better performance, and increasing the number of actively running threads and warps. In lab scenarios, integrating embedded GPUs becomes essential for algorithmic optimization on SoC-GPUs. This study focuses on quickly improving the Fast Hadamard Single-Pixel Imaging (FHSI) for 2D image reconstruction on Nvidia's Xavier platform. By implementing various parallel computing techniques in PyCUDA, we managed to speed up the process by approximately 10 times, significantly reducing processing times to nearly real-time levels.

show abstract

Section: Introductionmentioning

confidence: 99%

Accelerating the Fast Hadamard Single-Pixel Imaging (FHSI) on GPUs: Techniques and Optimizations

Quero,

Durini,

Rangel-Magdaleno

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Therefore, scholars have proposed some tools, such as SpMV Auto-Tuner (SMAT), to select the optimal compression format adapted to the hardware structure from various compression formats [9][10][11][12] by analyzing the distribution characteristics of non-zero elements in the sparse matrix. In addition, some researches have proposed to improve the performance of SpMV by accelerating the computing speed of the processor, such as using the heterogeneous parallel computing structure of CPU+GPU [13]. However, when the SpMV algorithm is running, the memory access to the compressed sparse matrix elements are continuous at the same time as the memory access to the x elements are irregular.…”

Section: Introductionmentioning

confidence: 99%

Mapping and Optimization Method of SpMV on Multi-DSP Accelerator

2022

View full text Add to dashboard Cite

Sparse matrix-vector multiplication (SpMV) solves the product of a sparse matrix and dense vector, and the sparseness of a sparse matrix is often more than 90%. Usually, the sparse matrix is compressed to save storage resources, but this causes irregular access to dense vectors in the algorithm, which takes a lot of time and degrades the SpMV performance of the system. In this study, we design a dedicated channel in the DMA to implement an indirect memory access process to speed up the SpMV operation. On this basis, we propose six SpMV algorithm schemes and map them to optimize the performance of SpMV. The results show that the M processor’s SpMV performance reached 6.88 GFLOPS. Besides, the average performance of the HPCG benchmark is 2.8 GFLOPS.

show abstract

“…Using shared memory, there can be substantial performance gains, when compared with the global memory access, which takes far more clock cycles. The readers are directed to some of the work carried out by the first author towards various optimizations and usage of GPU for scientific computations at .…”

Section: Introductionmentioning

confidence: 99%

Kepler GPU accelerated recursive sorting using dynamic parallelism

Neelima¹,

Shamsundar²,

Narayan

et al. 2016

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary This paper focuses on the performance gain obtained on the Kepler graphics processing units (GPUs) for multi‐key quicksort. Because multi‐key quicksort is a recursive‐based algorithm, many of the researchers have found it tedious to parallelize the algorithm on the multi and many core architectures. A survey of the state‐of‐the‐art string sorting algorithms and a robust insight of the Kepler GPU architecture gave rise to an intriguing research idea of matching the template of multi‐key quicksort with the dynamic parallelism feature offered by the Kepler‐based GPU's. The CPU parallel implementation has an improvement of 33 to 50% and 62 to 75 improvement when compared with 8‐bit and 16‐bit parallel most significant digit radix sort, respectively. The GPU implementation of multi‐key quicksort gives 6× to 18× speed up compared with the CPU parallel implementation of parallel multi‐key quicksort. The GPU implementation of multi‐key quicksort achieves 1.5× to 3× speed up when compared with the GPU implementation of string sorting algorithm using singleton elements in the literature. Copyright © 2016 John Wiley & Sons, Ltd.

show abstract

A GPU Framework for Sparse Matrix Vector Multiplication

Cited by 10 publications

References 13 publications

Accelerating the Fast Hadamard Single-Pixel Imaging (FHSI) on GPUs: Techniques and Optimizations

Accelerating the Fast Hadamard Single-Pixel Imaging (FHSI) on GPUs: Techniques and Optimizations

Mapping and Optimization Method of SpMV on Multi-DSP Accelerator

Kepler GPU accelerated recursive sorting using dynamic parallelism

Contact Info

Product

Resources

About