Fast sparse matrix-vector multiplication by exploiting variable block structure

Vuduc, Richard; Moon, Hyun-Jin

doi:10.2172/891708

Cited by 51 publications

(61 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…At the low-level, future work will investigate rectangular register blocks as mentioned in Section V, and using variable sized blocks via splitting [31]. A practical approach to address variable blocking that exploits the recursive structure of CSB is as follows.…”

Section: Discussionmentioning

confidence: 99%

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Buluç

Williams

Oliker

et al. 2011

2011 IEEE International Parallel &Amp; Distributed Processing Symposium

109

View full text Add to dashboard Cite

Abstract-On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue.

show abstract

Section: Discussionmentioning

confidence: 99%

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Buluç

Williams

Oliker

et al. 2011

2011 IEEE International Parallel &Amp; Distributed Processing Symposium

109

View full text Add to dashboard Cite

show abstract

“…For instance, matrices with banded structure or where nonzeros are grouped in (almost) dense blocks occur often in practice. This insight can be used to create more optimised block-based storage formats [13], where only the position of nonzero blocks needs to be stored. This reduces the amount of metadata to store and increases the computational efficiency due to the dense local structure.…”

Section: Background and Related Workmentioning

confidence: 99%

Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA

Grigoras

Burovskiy

Luk

et al. 2016

2016 26th International Conference on Field Programmable Logic and Applications (FPL)

View full text Add to dashboard Cite

Abstract-Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. This problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. The efficiency of the proposed architecture combined with the effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems.

show abstract

“…BCSR blocks are row-and column-aligned at r and c elements boundaries, respectively. Although this alignment may seem restrictive and, generally, lead to more padding [14], it can greatly favor vectorization as it will be explained in the following. Figure 2 shows the SpMV kernel for BCSR with 2 × 2 blocks.…”

Section: Storage Formats For Sparse Matricesmentioning

confidence: 99%

“…Consequently, proper alignment of data should be considered as a prerequisite for performance when trying to vectorize SpMV. For this reason, BCSR compared to Unaligned BCSR (UBCSR) [14] is a more appropriate data structure for vectorization, since the logically aligned blocks of BCSR can be easily aligned in memory without any extra padding. Another not so obvious implication of the alignment requirements is that blocks not having at least one even dimension, such as the 3×1 and 3×3 blocks, cannot be efficiently vectorized, since they cannot be naturally aligned without effectively collapsing to larger blocks.…”

Section: Architectural Implications On the Execution Of Blocked And Vmentioning

confidence: 99%

Exploring the effect of block shapes on the performance of sparse kernels

Karakasis

Goumas

Koziris

2009

2009 IEEE International Symposium on Parallel &Amp; Distributed Processing

View full text Add to dashboard Cite

Fast sparse matrix-vector multiplication by exploiting variable block structure

Cited by 51 publications

References 21 publications

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA

Exploring the effect of block shapes on the performance of sparse kernels

Contact Info

Product

Resources

About