Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

Liu, Junhong; He, Xin; Liu, Weifeng; Tan, Guangming

doi:10.1007/s10766-018-0604-8

Cited by 17 publications

(14 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of these libraries include heterogeneous convolution algorithms and provide primitives that help in algorithm selection. In addition, several of previously proposed effective optimizations for machine learning kernels on CPUs and GPUs [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67], [68], [69], [70] can be potentially integrated into libraries.…”

Section: Related Workmentioning

confidence: 99%

FCNNLib: A Flexible Convolution Algorithm Library for Deep Learning on FPGAs

Liang

Xiao

et al. 2022

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

FCNNLib: A Flexible Convolution Algorithm Library for Deep Learning on FPGAs

Liang

Xiao

et al. 2022

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

“…However, SpMM attains a significantly higher fraction of peak on local compute nodes, hence completes its local computation steps faster. For example, the reported performance rate for SpMM on the NVIDIA P100 GPU range in 100−500 GFlops [17], yet SpGEMM can only achieve 1−10 GFlops on the same GPU [25]. Consequently, SpGEMM more effectively hides its communication costs with local computation, compared to SpMM.…”

Section: Related Workmentioning

confidence: 99%

Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication

Selvitopi

Brock

Nisa

et al. 2021

Proceedings of the ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies. CCS CONCEPTS• Computing methodologies → Parallel algorithms.

show abstract

“…For output formations, we have many options. Prior work used the expand-sort-compress strategy [15], [24], [18] or used accumulators based on heap [22], hash table [12], and a dense vector called SPA [20], [25]. Table I summarizes prior work based on their data access patterns.…”

Section: B Classes Of Spgemm Algorithms Categorized By Data Access Pa...mentioning

confidence: 99%

“…After the entire Ĉ is constructed, flop tuples in Ĉ are sorted and merged to generate the final output. Since sorting can be efficiently performed on GPUs, ESC SpGEMM can performed better than other algorithms on GPUs [15], [18]. The column ESC algorithm has access patterns for A, B, and C similar to the column SpGEMM algorithm.…”

Section: B Classes Of Spgemm Algorithms Categorized By Data Access Pa...mentioning

confidence: 99%

“…However, most practical SpGEMM operations have small compression factors. For example, when squaring matrices from the SuiteSparse Matrix Collection, more than 80% of SpGEMMs have a compression factor less than 3 and about 99% have a compression factor less than 6 [18]. Hence, for most practical scenarios, PB-SpGEMM performs predictably better than existing heap and hash algorithms.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking

Moreira²,

Edelsohn³

et al. 2020

Preprint

View full text Add to dashboard Cite

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in various graph, scientific computing and machine learning algorithms. It is well known that SpGEMM is a memory-bound operation, and its peak performance is expected to be bound by the memory bandwidth. Yet, existing algorithms fail to saturate the memory bandwidth, resulting in suboptimal performance under the Roofline model. In this paper we characterize existing SpGEMM algorithms based on their memory access patterns and develop practical lower and upper bounds for SpGEMM performance. We then develop an SpGEMM algorithm based on outer product matrix multiplication. The newly developed algorithm called PB-SpGEMM saturates memory bandwidth by using the propagation blocking technique and by performing in-cache sorting and merging. For many practical matrices, PB-SpGEMM runs 20%-50% faster than the state-of-the-art heap and hash SpGEMM algorithms on modern multicore processors. Most importantly, PB-SpGEMM attains performance predicted by the Roofline model, and its performance remains stable with respect to matrix size and sparsity.1 16 flops/byte (see Sec. II-C). At this arithmetic intensity, SpGEMM is a memory-bound operation, and SpGEMM's peak performance has an upper bound of β * AI, where β is the memory bandwidth. Assuming 50GB/s bandwidth available on a multicore processor, the estimated peak performance can be as high as 50/16 = 3.13 GFLOPS (billions of floating point

show abstract

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

Cited by 17 publications

References 43 publications

FCNNLib: A Flexible Convolution Algorithm Library for Deep Learning on FPGAs

FCNNLib: A Flexible Convolution Algorithm Library for Deep Learning on FPGAs

Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication

Bandwidth-Optimized Parallel Algorithms for Sparse Matrix-Matrix Multiplication using Propagation Blocking

Contact Info

Product

Resources

About