High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU

Nagasaka, Yusuke; Nukada, Akira; Matsuoka, Satoshi

doi:10.1109/icpp.2017.19

Cited by 60 publications

(84 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In its marketing materials, cuSPARSE claims a 2-5× speedup over CPU competitors, and the raw computational and memory throughput of a GPU has a similar multiple over the CPU, so we believe this kernel represents the most significant opportunity to improve GPU performance. Recent GPU library implementations, including bhSPARSE [7], nsparse [10], and RMerge2 [3], have demonstrated significant speedups over cuSPARSE, and may be well-suited for the matrix operations we require in this challenge. cuSPARSE has the unenviable task of running effectively on any sparse matrix and thus its developers may have concentrated more on generality than performance.…”

Section: Discussionmentioning

confidence: 99%

Accelerating DNN Inference with GraphBLAS and the GPU

Wang

Lin

Yang

et al. 2019

2019 IEEE High Performance Extreme Computing Conference (HPEC)

View full text Add to dashboard Cite

This work addresses the 2019 Sparse Deep Neural Network Graph Challenge with an implementation of this challenge using the GraphBLAS programming model. We demonstrate our solution to this challenge with GraphBLAST, a GraphBLAS implementation on the GPU, and compare it to SuiteSparse, a GraphBLAS implementation on the CPU. The GraphBLAST implementation is 1.94× faster than Suite-Sparse; the primary opportunity to increase performance on the GPU is a higher-performance sparse-matrix-times-sparse-matrix (SpGEMM) kernel.

show abstract

Section: Discussionmentioning

confidence: 99%

Accelerating DNN Inference with GraphBLAS and the GPU

Wang

Lin

Yang

et al. 2019

2019 IEEE High Performance Extreme Computing Conference (HPEC)

View full text Add to dashboard Cite

show abstract

“…In the beginning, we resorted to the standard kernel available in the cusparse [23] library (cusparseDcsrmm). However, we found that its performance was far from being optimal and we changed our code to use Nsparse, a recent implementation of sparse matrix-matrix product available in open source format [24]. Nsparse, as the implementation of Suitor, relies on the legacy shuffle primitives, nevertheless it provides a clear advantage with respect to the general-purpose primitives available in cusparse.…”

Section: Setup Of the Preconditionermentioning

confidence: 99%

AMG based on compatible weighted matching for GPUs

Bernaschi¹,

D'Ambra²,

Pasquini

2020

Parallel Computing

View full text Add to dashboard Cite

We describe main issues and design principles of an efficient implementation, tailored to recent generations of Nvidia Graphics Processing Units (GPUs), of an Algebraic MultiGrid (AMG) preconditioner previously proposed by one of the authors and already available in the open-source package BootCMatch: Bootstrap algebraic multigrid based on Compatible weighted Matching for standard CPU. The AMG method relies on a new approach for coarsening sparse symmetric positive definite (s.p.d.) matrices, named coarsening based on compatible weighted matching. It exploits maximum weight matching in the adjacency graph of the sparse matrix, driven by the principle of compatible relaxation, providing a suitable aggregation of unknowns which goes beyond the limits of the usual heuristics applied in the current methods. We adopt an approximate solution of the maximum weight matching problem, based on a recently proposed parallel algorithm, referred as the Suitor algorithm, and show that it allow us to obtain good quality coarse matrices for our AMG on GPUs. We exploit inherent parallelism of modern GPUs in all the kernels involving sparse matrix computations both for the setup of the preconditioner and for its application in a Krylov solver, outperforming preconditioners available in Nvidia AmgX library. We report results about a large set of linear systems arising from discretization of scalar and vector partial differential equations (PDEs).

show abstract

“…First, we show light-weight thread scheduling scheme with load-balancing for SpGEMM. Next, we show the optimization schemes for hash table based SpGEMM, which is proposed for GPU [25], and heap based shared-memory SpGEMM algorithms [3]. Additionally, we extend the Hash SpGEMM with utilizing vector registers of Intel Xeon or Xeon Phi.…”

Section: Architecture Specific Optimization Of Spgemmmentioning

confidence: 99%

“…We use hash table for accumulator in SpGEMM computation, based on GPU work [25]. Figure 7 shows the algorithm of Hash SpGEMM for multi-and many-core processors.…”

Section: Hash Spgemmmentioning

confidence: 99%

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures

Nagasaka

Matsuoka

Azad

et al. 2018

Proceedings of the 47th International Conference on Parallel Processing Companion

Self Cite

View full text Add to dashboard Cite

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi-and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi-and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.

show abstract

High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU

Cited by 60 publications

References 17 publications

Accelerating DNN Inference with GraphBLAS and the GPU

Accelerating DNN Inference with GraphBLAS and the GPU

AMG based on compatible weighted matching for GPUs

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures

Contact Info

Product

Resources

About