2019
DOI: 10.1007/s10766-018-0604-8
|View full text |Cite
|
Sign up to set email alerts
|

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

Abstract: General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block of a number of high-level algorithms and real-world applications. In recent years, several efficient SpGEMM algorithms have been proposed for many-core processors such as GPUs. However, their implementations of sparse accumulators, the core component of SpGEMM, mostly use low speed on-chip shared memory and global memory, and high speed registers are seriously underutilised. In this paper, we propose three novel register-aware… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 17 publications
(14 citation statements)
references
References 43 publications
0
14
0
Order By: Relevance
“…Most of these libraries include heterogeneous convolution algorithms and provide primitives that help in algorithm selection. In addition, several of previously proposed effective optimizations for machine learning kernels on CPUs and GPUs [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70] can be potentially integrated into libraries.…”
Section: Related Workmentioning
confidence: 99%
“…Most of these libraries include heterogeneous convolution algorithms and provide primitives that help in algorithm selection. In addition, several of previously proposed effective optimizations for machine learning kernels on CPUs and GPUs [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70] can be potentially integrated into libraries.…”
Section: Related Workmentioning
confidence: 99%
“…However, SpMM attains a significantly higher fraction of peak on local compute nodes, hence completes its local computation steps faster. For example, the reported performance rate for SpMM on the NVIDIA P100 GPU range in 100−500 GFlops [17], yet SpGEMM can only achieve 1−10 GFlops on the same GPU [25]. Consequently, SpGEMM more effectively hides its communication costs with local computation, compared to SpMM.…”
Section: Related Workmentioning
confidence: 99%
“…For output formations, we have many options. Prior work used the expand-sort-compress strategy [15], [24], [18] or used accumulators based on heap [22], hash table [12], and a dense vector called SPA [20], [25]. Table I summarizes prior work based on their data access patterns.…”
Section: B Classes Of Spgemm Algorithms Categorized By Data Access Pa...mentioning
confidence: 99%
“…After the entire Ĉ is constructed, flop tuples in Ĉ are sorted and merged to generate the final output. Since sorting can be efficiently performed on GPUs, ESC SpGEMM can performed better than other algorithms on GPUs [15], [18]. The column ESC algorithm has access patterns for A, B, and C similar to the column SpGEMM algorithm.…”
Section: B Classes Of Spgemm Algorithms Categorized By Data Access Pa...mentioning
confidence: 99%
See 1 more Smart Citation