Abstract:The multiplication of a sparse matrix by a dense vector (SpMV) is a centerpiece of scientific computing applications: it is the essential kernel for the solution of sparse linear systems and sparse eigenvalue problems by iterative methods. The efficient implementation of the sparse matrixvector multiplication is therefore crucial and has been the subject of an immense amount of research, with interest renewed with every major new trend in high performance computing architectures. The introduction of General Pu… Show more
“…A GPU dose engine is adopted to calculate the dose contribution matrix and then the matrix is converted to the most memory efficient sparse matrix format, that is compressed sparse row (CSR) format . The CSR format uses three arrays to store the nonzero elements, corresponding column indices and compressed row offsets which indicate the boundary of each row.…”
Section: Methodsmentioning
confidence: 99%
“…format. 31 The CSR format uses three arrays to store the nonzero 3) did not support GPU acceleration. Table 2 lists the run times of the three cases (the times of dose calculation are not included).…”
Section: Heterogeneous Platform With Cpu and Gpumentioning
Robust optimization has been shown to be effective for stabilizing treatment planning in intensity modulated proton therapy (IMPT), but existing algorithms for the optimization process is time‐consuming. This paper describes a fast robust optimization tool that takes advantage of the GPU parallel computing technologies. The new robust optimization model is based on nine boundary dose distributions — two for ±range uncertainties, six for ±set‐up uncertainties along anteroposterior (A‐P), lateral (R‐L) and superior‐inferior (S‐I) directions, and one for nominal situation. The nine boundary influence matrices were calculated using an in‐house finite size pencil beam dose engine, while the conjugate gradient method was applied to minimize the objective function. The proton dose calculation algorithm and the conjugate gradient method were tuned for heterogeneous platforms involving the CPU host and GPU device. Three clinical cases — one head and neck cancer case, one lung cancer case, and one prostate cancer case — were investigated to demonstrate the clinical feasibility of the proposed robust optimizer. Compared with results from Varian Eclipse (version 13.3), the proposed method is found to be conducive to robust treatment planning that is less sensitive to range and setup uncertainties. The three tested cases show that targets can achieve high dose uniformity while organs at risks (OARs) are in better protection against setup and range errors. Based on the CPU + GPU heterogeneous platform, the execution times of the head and neck cancer case and the prostate cancer case are much less than half of Eclipse, while the run time of the lung cancer case is similar to that of Eclipse. The fast robust optimizer developed in this study can improve the reliability of traditional proton treatment planning in a much faster speed, thus making it possible for clinical utility.
“…A GPU dose engine is adopted to calculate the dose contribution matrix and then the matrix is converted to the most memory efficient sparse matrix format, that is compressed sparse row (CSR) format . The CSR format uses three arrays to store the nonzero elements, corresponding column indices and compressed row offsets which indicate the boundary of each row.…”
Section: Methodsmentioning
confidence: 99%
“…format. 31 The CSR format uses three arrays to store the nonzero 3) did not support GPU acceleration. Table 2 lists the run times of the three cases (the times of dose calculation are not included).…”
Section: Heterogeneous Platform With Cpu and Gpumentioning
Robust optimization has been shown to be effective for stabilizing treatment planning in intensity modulated proton therapy (IMPT), but existing algorithms for the optimization process is time‐consuming. This paper describes a fast robust optimization tool that takes advantage of the GPU parallel computing technologies. The new robust optimization model is based on nine boundary dose distributions — two for ±range uncertainties, six for ±set‐up uncertainties along anteroposterior (A‐P), lateral (R‐L) and superior‐inferior (S‐I) directions, and one for nominal situation. The nine boundary influence matrices were calculated using an in‐house finite size pencil beam dose engine, while the conjugate gradient method was applied to minimize the objective function. The proton dose calculation algorithm and the conjugate gradient method were tuned for heterogeneous platforms involving the CPU host and GPU device. Three clinical cases — one head and neck cancer case, one lung cancer case, and one prostate cancer case — were investigated to demonstrate the clinical feasibility of the proposed robust optimizer. Compared with results from Varian Eclipse (version 13.3), the proposed method is found to be conducive to robust treatment planning that is less sensitive to range and setup uncertainties. The three tested cases show that targets can achieve high dose uniformity while organs at risks (OARs) are in better protection against setup and range errors. Based on the CPU + GPU heterogeneous platform, the execution times of the head and neck cancer case and the prostate cancer case are much less than half of Eclipse, while the run time of the lung cancer case is similar to that of Eclipse. The fast robust optimizer developed in this study can improve the reliability of traditional proton treatment planning in a much faster speed, thus making it possible for clinical utility.
“…We distinguish between multiplying a sparsematrix by a dense-vector (SpMV) and by a sparse-vector (SpMSpV). There is extensive literature focusing on SpMV for GPUs (including a comprehensive survey [22]). However, we concentrate on SpMSpV, because it is more relevant to graph search algorithms where the vector represents the subset of vertices that are currently active and is typically sparse.…”
Section: Two Roads To Matrix-vector Multiplicationmentioning
pages. https://doi.org/10. 1145/nnnnnnn.nnnnnnn High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based in sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first open-source linear algebra-based graph framework on GPU targeting high-performance computing. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.
“…For this reason, optimizing SpMV has been extensively studied by many researchers (see, eg, other works). Two recent works provide a detailed survey …”
Section: Related Workmentioning
confidence: 99%
“…However, it is known that SpMV's performance falls well behind the capacity of modern computers . Hence, optimization of SpMV has been extensively studied (see the works of Langr and Tvrdik, and Filippone et al for comprehensive surveys).…”
Summary
Sparse matrix‐vector multiplication (SpMV) is a crucial operation used for solving many engineering and scientific problems. In general, there is no single SpMV method that gives high performance for all sparse matrices. Even though there exist sparse matrix storage formats and SpMV implementations that yield high efficiency for certain matrix structures, using these methods may entail high preprocessing or format conversion costs. In this work, we present a new SpMV implementation, named CSRLenGoto, that can be utilized by preprocessing the Compressed Sparse Row (CSR) format of a matrix. This preprocessing phase is inexpensive enough for the associated cost to be compensated in just a few repetitions of the SpMV operation. CSRLenGoto is based on complete loop unrolling and gives performance improvements in particular for matrices whose mean row length is low. We parallelized our method by integrating it into a state‐of‐the‐art matrix partitioning approach as the kernel operation. We observed up to 2.46× and on the average 1.29× speedup with respect to Intel MKL's SpMV function for matrices with short‐ or medium‐length rows.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.