2014 IEEE 13th International Symposium on Parallel and Distributed Computing 2014
DOI: 10.1109/ispdc.2014.10
|View full text |Cite
|
Sign up to set email alerts
|

A GPU Framework for Sparse Matrix Vector Multiplication

Abstract: The hardware and software evolutions related to Graphics Processing Units (GPUs), for general purpose computations, have changed the way the parallel programming issues are addressed. Many applications are being ported onto GPU for achieving performance gain. The GPU execution time is continuously optimized by the GPU programmers while optimizing pre-GPU computation overheads attracted the research community in the recent past. While GPU executes the programs given by a CPU, pre-GPU computation overheads does … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
3
0

Year Published

2014
2014
2024
2024

Publication Types

Select...
3
2
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 10 publications
(3 citation statements)
references
References 13 publications
0
3
0
Order By: Relevance
“…We leveraged the GPU-accelerated capabilities of the Nvidia Xavier platform, employing PyCUDA [22,23] to develop kernels aimed at optimizing bottleneck operations. Our optimization strategies included the use of shared memory with padding to prevent bank conflicts [24], coalescing global memory accesses for increased throughput [25], pre-computation of constants to diminish runtime calculations [26], loop unrolling [27], and warp divergence minimization through conditional optimization [4]. We also selected faster arithmetic operations when the precision requirements permitted and minimized synchronization needs.…”
Section: Introductionmentioning
confidence: 99%
“…We leveraged the GPU-accelerated capabilities of the Nvidia Xavier platform, employing PyCUDA [22,23] to develop kernels aimed at optimizing bottleneck operations. Our optimization strategies included the use of shared memory with padding to prevent bank conflicts [24], coalescing global memory accesses for increased throughput [25], pre-computation of constants to diminish runtime calculations [26], loop unrolling [27], and warp divergence minimization through conditional optimization [4]. We also selected faster arithmetic operations when the precision requirements permitted and minimized synchronization needs.…”
Section: Introductionmentioning
confidence: 99%
“…Therefore, scholars have proposed some tools, such as SpMV Auto-Tuner (SMAT), to select the optimal compression format adapted to the hardware structure from various compression formats [9][10][11][12] by analyzing the distribution characteristics of non-zero elements in the sparse matrix. In addition, some researches have proposed to improve the performance of SpMV by accelerating the computing speed of the processor, such as using the heterogeneous parallel computing structure of CPU+GPU [13]. However, when the SpMV algorithm is running, the memory access to the compressed sparse matrix elements are continuous at the same time as the memory access to the x elements are irregular.…”
Section: Introductionmentioning
confidence: 99%
“…Using shared memory, there can be substantial performance gains, when compared with the global memory access, which takes far more clock cycles. The readers are directed to some of the work carried out by the first author towards various optimizations and usage of GPU for scientific computations at .…”
Section: Introductionmentioning
confidence: 99%