Model-driven autotuning of sparse matrix-vector multiply on GPUs

Choi, Jee; Singh, Amik; Vuduc, Richard

doi:10.1145/1693453.1693471

Cited by 221 publications

(106 citation statements)

References 16 publications

Supporting

Mentioning

104

Contrasting

Unclassified

Order By: Relevance

“…Choi et al [12] designed a blocked ELLPACK format and proposed a CUDA performance model to predict matrixdependent tuning parameters. Xu et al [13] proposed the optimized SpMV based on ELL format and a SpMV CUDA performance model.…”

Section: Related Workmentioning

confidence: 99%

Accurate CUDA performance modeling for sparse matrix-vector multiplication

Guo¹,

Wang²

2012

2012 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

Abstract-This paper presents an integrated analytical and profile-based CUDA performance modeling approach to accurately predict the kernel execution times of sparse matrix-vector multiplication for CSR, ELL, COO, and HYB SpMV CUDA kernels. Based on our experiments conducted on a collection of 8 widely-used testing matrices on NVIDIA Tesla C2050, the execution times predicted by our model match the measured execution times of NVIDIA's SpMV implementations very well. Specifically, for 29 out of 32 test cases, the performance differences are under or around 7%. For the rest 3 test cases, the differences are between 8% and 10%. For CSR, ELL, COO, and HYB SpMV kernels, the differences are 4.2%, 5.2%, 1.0%, and 5.7% on the average, respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Accurate CUDA performance modeling for sparse matrix-vector multiplication

Guo¹,

Wang²

2012

2012 International Conference on High Performance Computing &Amp; Simulation (HPCS)

View full text Add to dashboard Cite

show abstract

“…Because of less off-chip memory access and better on-chip memory localization, block-based formats or libraries, such as OSKI [38,42,43], pOSKI [37], CSB [44,45], BELLPACK [46], BCCOO/BCCOO+ [5], BRC [6] and RSB [47], attracted the most attention. However, block-based formats heavily rely on sparsity structure, meaning that the input matrix is required to have a block structure to meet potential block layout.…”

Section: Comparison To Related Methodsmentioning

confidence: 99%

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

Liu

Vinter

2015

Parallel Computing

View full text Add to dashboard Cite

Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms.

show abstract

“…Their method is general and requires that global memory bandwidth is not the bottleneck for performance. In [35], a new compressed format is proposed for sparse matrix on GPU, and search is needed to determine certain parameters of the format. They proposed an analytical model specific to SpMV to eliminate search candidates.…”

Section: Related Workmentioning

confidence: 99%

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

Yang

Cui

Feng

et al. 2012

J. Comput. Sci. Technol.

View full text Add to dashboard Cite

In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers eAEectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four diAEerent types of stencils on three diAEerent GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.

show abstract

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Cited by 221 publications

References 16 publications

Accurate CUDA performance modeling for sparse matrix-vector multiplication

Accurate CUDA performance modeling for sparse matrix-vector multiplication

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

Contact Info

Product

Resources

About