Pattern-based sparse matrix representation for memory-efficient SMVM kernels

Belgin, Mehmet; Back, Godmar; Ribbens, Calvin J.

doi:10.1145/1542275.1542294

Cited by 54 publications

(53 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Methodologies for exploring symmetry in serial are also examined [17]. A recent study [2] utilized pattern-based accelerated SPMV to reduce indexing overhead by representing repeated sparsity patterns in the matrix with a single index, using specialized kernels to perform the operations. This method avoids filling in zeros by using bit vectors to concisely represent frequently recurring block patterns.…”

Section: Related Workmentioning

confidence: 99%

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Buluç

Williams

Oliker

et al. 2011

2011 IEEE International Parallel &Amp; Distributed Processing Symposium

109

View full text Add to dashboard Cite

Abstract-On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue.

show abstract

Section: Related Workmentioning

confidence: 99%

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Buluç

Williams

Oliker

et al. 2011

2011 IEEE International Parallel &Amp; Distributed Processing Symposium

109

View full text Add to dashboard Cite

show abstract

“…• Belgin et al ( [11]) propose pattern based representations (PBR) targeted at matrices exhibiting noncontiguous nonzero patterns. Provided with apt matrices, RSB would probably benefit from such an approach while still retaining its cache blocking properties.…”

Section: Future Directionsmentioning

confidence: 99%

“…Provided with apt matrices, RSB would probably benefit from such an approach while still retaining its cache blocking properties. However, an efficient implementation of PBR (according to its authors in [11]) should rely on machine specific intrinsics, and as such is of limited portability.…”

Section: Future Directionsmentioning

confidence: 99%

“…However, unlike CSB (Buluç et al [8]) the sparse blocks dimensions are not uniform, and unlike Yzelman and Bisseling's ( [9]) our techniques are not hyper-graph based. Similarly to other approaches, selection of a data structure for blocks occurs, but without using completely novel formats, as Kourtis et al [10] do with CSX or as Belgin et al [11] do with PBR. Unlike approaches combining dense blocking and autotuning techniques (like BCSR in SPARSITY, by Im et al [12]) RSB does not not require the representation of excess zeroes, but still has a potential for autotuning.…”

Section: Introduction and Related Literaturementioning

confidence: 99%

See 1 more Smart Citation

Efficient multithreaded untransposed, transposed or symmetric sparse matrix–vector multiplication with the Recursive Sparse Blocks format

Martone

2014

Parallel Computing

View full text Add to dashboard Cite

In earlier work we have introduced the "Recursive Sparse Blocks" (RSB) sparse matrix storage scheme oriented towards cache efficient matrix-vector multiplication (SpMV ) and triangular solution (SpSV ) on cache based shared memory parallel computers. Both the transposed (SpMV T ) and symmetric (SymSpMV ) matrix-vector multiply variants are supported. RSB stands for a meta-format: it recursively partitions a rectangular sparse matrix in quadrants; leaf submatrices are stored in an appropriate traditional formateither Compressed Sparse Rows (CSR) or Coordinate (COO). In this work, we compare the performance of our RSB implementation of SpMV, SpMV T, SymSpMV to that of the state-of-the-art Intel Math Kernel Library (MKL) CSR implementation on the recent Intel's Sandy Bridge processor. Our results with a few dozens of real world large matrices suggest the efficiency of the approach: in all of the cases, RSB's SymSpMV (and in most cases, SpMV T as well) took less than half of MKL CSR's time; SpMV 's advantage was smaller. Furthermore, RSB's SpMV T is more scalable than MKL's CSR, in that it performs almost as well as SpMV. Additionally, we include comparisons to the state-of-the art format Compressed Sparse Blocks (CSB) implementation. We observed RSB to be slightly superior to CSB in SpMV T, slightly inferior in SpMV, and better (in most cases by a factor of two or more) in SymSpMV. Although RSB is a non-traditional storage format and thus needs a special constructor, it can be assembled from CSR or any other similar rowordered representation arrays in the time of a few dozens of matrix-vector multiply executions. Thanks to its significant advantage over MKL's CSR routines for symmetric or transposed matrix-vector multiplication, in most of the observed cases the assembly cost has been observed to amortize with fewer than fifty iterations.

show abstract

“…The experiments were conducted on two Intel Clovertown with 4MB of L2 cache each. In the same direction, Belgin et al [3] proposed a pattern-based blocking scheme for reducing the index overhead. Accompanied by software prefetching and vectorization techniques, they attained an average sequential speedup of 1.4.…”

Section: Related Workmentioning

confidence: 99%

Parallel Structurally-Symmetric Sparse Matrix-Vector Products on Multi-Core Processors

Batista¹,

Ainsworth²,

Ribeiro³

Proceedings of the Third International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering

View full text Add to dashboard Cite

We consider the problem of developing an efficient multi-threaded implementation of the matrix-vector multiplication algorithm for sparse matrices with structural symmetry. Matrices are stored using the compressed sparse row-column format (CSRC), designed for profiting from the symmetric non-zero pattern observed in global finite element matrices. Unlike classical compressed storage formats, performing the sparse matrix-vector product using the CSRC requires thread-safe access to the destination vector. To avoid race conditions, we have implemented two partitioning strategies. In the first one, each thread allocates an array for storing its contributions, which are later combined in an accumulation step. We analyze how to perform this accumulation in four different ways. The second strategy employs a coloring algorithm for grouping rows that can be concurrently processed by threads. Our results indicate that, although incurring an increase in the working set size, the former approach leads to the best performance improvements for most matrices.

show abstract

Pattern-based sparse matrix representation for memory-efficient SMVM kernels

Cited by 54 publications

References 36 publications

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Efficient multithreaded untransposed, transposed or symmetric sparse matrix–vector multiplication with the Recursive Sparse Blocks format

Parallel Structurally-Symmetric Sparse Matrix-Vector Products on Multi-Core Processors

Contact Info

Product

Resources

About