On improving performance of sparse matrix-matrix multiplication on GPUs

Kunchum, Rakshith; Chaudhry, Ankur; Sukumaran-Rajam, Aravind; Niu, Qingpeng; Nisa, Israt; Sadayappan, P.

doi:10.1145/3079079.3079106

Cited by 19 publications

(11 citation statements)

References 21 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This boundary becomes too small for many sparse datasets which would instead benefit from coupling the shared memory size to individual row degrees. Inspired by other sparse matrix multiplication implementations on the GPU [8,30,32,34], we enhanced the vector insertion and lookup patterns of the COO SPMV design outlined in [2] by building a hash table to store these columns in shared memory. Unlike many other hash table implementations on the GPU [5,6,9,13,39], our implementation builds an independent hash table per thread-block and so many other designs and concurrency patterns that optimize the key distribution and collision-resolution strategies for the GPU are not efficient or cannot be easily ported for our use-case.…”

Section: Load Balanced Hybrid Csr+coomentioning

confidence: 99%

GPU Semiring Primitives for Sparse Neighborhood Methods

Nolet¹,

Divye²,

Raff³

et al. 2021

Preprint

View full text Add to dashboard Cite

High-performance primitives for mathematical operations on sparse vectors must deal with the challenges of skewed degree distributions and limits on memory consumption that are typically not issues in dense operations. We demonstrate that a sparse semiring primitive can be flexible enough to support a wide range of critical distance measures while maintaining performance and memory efficiency on the GPU. We further show that this primitive is a foundational component for enabling many neighborhood-based information retrieval and machine learning algorithms to accept sparse input. To our knowledge, this is the first work aiming to unify the computation of several critical distance measures on the GPU under a single flexible design paradigm and we hope that it provides a good baseline for future research in this area.

show abstract

Section: Load Balanced Hybrid Csr+coomentioning

confidence: 99%

GPU Semiring Primitives for Sparse Neighborhood Methods

Nolet¹,

Divye²,

Raff³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Related work on parallel SpGEMM. SpGEMM algorithms are extensively studied in the literature, with several parallel algorithms available for distributed memory systems [7], [9]- [11], for GPUs [12]- [19], and for multi-core systems [12], [20], [21]. Multiplying sparse matrices in parallel can be challenging for a number of reasons.…”

Section: Hipmclmentioning

confidence: 99%

Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

Selvitopi

Hussain

Azad

et al. 2020

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

“…Thread-Flat-Parallel: We use a Thread-Flat-Parallel scheme (Figure 3b) to overcome the limitations of the previous methods. This has also been explored in [8] and [19]. In this scheme, a row of A is assigned to a team, but as opposed to the Thread-Parallel scheme, this method flattens the second and third loop (Line-4 and Line-5).…”

Section: Spgemm Partitioning Schemesmentioning

confidence: 99%

Multi-threaded Sparse Matrix Sparse Matrix Multiplication for Many-Core and GPU Architectures.

Deveci

Trott

Rajamanickam

2018

View full text Add to dashboard Cite

Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrixmatrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM, to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved. arXiv:1801.03065v1 [cs.DC] 9 Jan 2018 very different characteristics. For example, traditional cpus have powerful cores with large caches, while XeonPhi processors have many lightweight cores, and GPUs provide extensive hierarchical parallelism with very simple computational units. The algorithms in this paper aim to minimize revisiting algorithmic design for these different architectures. The code divergence in the implementation and how different levels of algorithmic parallelism are mapped to computational units. is limited to access strategies of different data structures and how different levels of parallelism in the algorithm are mapped to computational units.An earlier version of this paper [13] focuses on spgemm from the perspective of performanceportability. It addressed the issue of performance-portability for spgemm with an algorithm called kkmem. It demonstrated better performance on gpus and the current generation of XeonPhi processors, Knights Landing (knls), w.r.t. state-of-art libraries. Our contributions in [13] is summarized below.• We design two thread-scalable data structures (multilevel hashmap accumulators and a memory pool) to achieve scalability on various platforms, and a graph compression technique to speedup the symbolic factorization of spgemm.• We design hierarchical, thread-scalable spgemm algorithms and implement them using the Kokkos programming model. Our implementation is available at https://github.com/kokkos/kokkos-kernels and also in the Trilinos framework (https://github.com/trilinos/Trilinos).• We also present results for the practical case of matrix structure reuse, and demonstrate its importance for application performance. This paper extends [13] with several new algorithm design choices and additional data structures.Its contributions are summarized below.• We present results for the selection of kernel parameters e.g., partitioning scheme and data structures with trade-offs for memory access vs. computational overhead cost, and provide heuristics to choose the best parameters depending on the prob...

show abstract

On improving performance of sparse matrix-matrix multiplication on GPUs

Cited by 19 publications

References 21 publications

GPU Semiring Primitives for Sparse Neighborhood Methods

GPU Semiring Primitives for Sparse Neighborhood Methods

Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

Multi-threaded Sparse Matrix Sparse Matrix Multiplication for Many-Core and GPU Architectures.

Contact Info

Product

Resources

About