Sparse Matrix-Vector Multiplication on GPGPUs

Filippone, Salvatore; Cardellini, Valeria; Barbieri, Davide; Fanfarillo, Alessandro

doi:10.1145/3017994

Cited by 103 publications

(77 citation statements)

References 98 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A GPU dose engine is adopted to calculate the dose contribution matrix

φ_{i, j}

and then the matrix is converted to the most memory efficient sparse matrix format, that is compressed sparse row (CSR) format . The CSR format uses three arrays to store the nonzero elements, corresponding column indices and compressed row offsets which indicate the boundary of each row.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

A fast robust optimizer for intensity modulated proton therapy using GPU

Chen

Cao

et al. 2020

J Applied Clin Med Phys

View full text Add to dashboard Cite

Robust optimization has been shown to be effective for stabilizing treatment planning in intensity modulated proton therapy (IMPT), but existing algorithms for the optimization process is time‐consuming. This paper describes a fast robust optimization tool that takes advantage of the GPU parallel computing technologies. The new robust optimization model is based on nine boundary dose distributions — two for ±range uncertainties, six for ±set‐up uncertainties along anteroposterior (A‐P), lateral (R‐L) and superior‐inferior (S‐I) directions, and one for nominal situation. The nine boundary influence matrices were calculated using an in‐house finite size pencil beam dose engine, while the conjugate gradient method was applied to minimize the objective function. The proton dose calculation algorithm and the conjugate gradient method were tuned for heterogeneous platforms involving the CPU host and GPU device. Three clinical cases — one head and neck cancer case, one lung cancer case, and one prostate cancer case — were investigated to demonstrate the clinical feasibility of the proposed robust optimizer. Compared with results from Varian Eclipse (version 13.3), the proposed method is found to be conducive to robust treatment planning that is less sensitive to range and setup uncertainties. The three tested cases show that targets can achieve high dose uniformity while organs at risks (OARs) are in better protection against setup and range errors. Based on the CPU + GPU heterogeneous platform, the execution times of the head and neck cancer case and the prostate cancer case are much less than half of Eclipse, while the run time of the lung cancer case is similar to that of Eclipse. The fast robust optimizer developed in this study can improve the reliability of traditional proton treatment planning in a much faster speed, thus making it possible for clinical utility.

show abstract

“…A GPU dose engine is adopted to calculate the dose contribution matrix

φ_{i, j}

Section: Methodsmentioning

confidence: 99%

“…format. 31 The CSR format uses three arrays to store the nonzero 3) did not support GPU acceleration. Table 2 lists the run times of the three cases (the times of dose calculation are not included).…”

Section: Heterogeneous Platform With Cpu and Gpumentioning

confidence: 99%

A fast robust optimizer for intensity modulated proton therapy using GPU

Chen

Cao

et al. 2020

J Applied Clin Med Phys

View full text Add to dashboard Cite

show abstract

“…We distinguish between multiplying a sparsematrix by a dense-vector (SpMV) and by a sparse-vector (SpMSpV). There is extensive literature focusing on SpMV for GPUs (including a comprehensive survey [22]). However, we concentrate on SpMSpV, because it is more relevant to graph search algorithms where the vector represents the subset of vertices that are currently active and is typically sparse.…”

Section: Two Roads To Matrix-vector Multiplicationmentioning

confidence: 99%

HPGA: A High-Performance Graph Analytics Framework on the GPU

Yang

Wen

et al. 2018

2018 International Conference on Information Systems and Computer Aided Education (ICISCAE)

View full text Add to dashboard Cite

pages. https://doi.org/10. 1145/nnnnnnn.nnnnnnn High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based in sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first open-source linear algebra-based graph framework on GPU targeting high-performance computing. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.

show abstract

“…For this reason, optimizing SpMV has been extensively studied by many researchers (see, eg, other works). Two recent works provide a detailed survey …”

Section: Related Workmentioning

confidence: 99%

“…However, it is known that SpMV's performance falls well behind the capacity of modern computers . Hence, optimization of SpMV has been extensively studied (see the works of Langr and Tvrdik, and Filippone et al for comprehensive surveys).…”

Section: Introductionmentioning

confidence: 99%

A sparse matrix‐vector multiplication method with low preprocessing cost

Aktemur

2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary Sparse matrix‐vector multiplication (SpMV) is a crucial operation used for solving many engineering and scientific problems. In general, there is no single SpMV method that gives high performance for all sparse matrices. Even though there exist sparse matrix storage formats and SpMV implementations that yield high efficiency for certain matrix structures, using these methods may entail high preprocessing or format conversion costs. In this work, we present a new SpMV implementation, named CSRLenGoto, that can be utilized by preprocessing the Compressed Sparse Row (CSR) format of a matrix. This preprocessing phase is inexpensive enough for the associated cost to be compensated in just a few repetitions of the SpMV operation. CSRLenGoto is based on complete loop unrolling and gives performance improvements in particular for matrices whose mean row length is low. We parallelized our method by integrating it into a state‐of‐the‐art matrix partitioning approach as the kernel operation. We observed up to 2.46× and on the average 1.29× speedup with respect to Intel MKL's SpMV function for matrices with short‐ or medium‐length rows.

show abstract

Sparse Matrix-Vector Multiplication on GPGPUs

Cited by 103 publications

References 98 publications

A fast robust optimizer for intensity modulated proton therapy using GPU

A fast robust optimizer for intensity modulated proton therapy using GPU

HPGA: A High-Performance Graph Analytics Framework on the GPU

A sparse matrix‐vector multiplication method with low preprocessing cost

Contact Info

Product

Resources

About