Towards a Fast Parallel Sparse Matrix-Vector Multiplication

Geus, Roman; Röllin, Stefan

doi:10.1142/9781848160170_0036

Cited by 22 publications

(26 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A number consider techniques that compress the data structure by recognizing patterns in order to eliminate the integer index overhead. These patterns include blocks [10], variable or mixtures of differently-sized blocks [6] diagonals, which may be especially well-suited to machines with SIMD and vector units [19], dense subtriangles arising in sparse triangular solve [22], and symmetry [11], and combinations.…”

Section: Oski Oski-petsc and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Optimization of sparse matrix–vector multiplication on emerging multicore platforms

et al. 2009

View full text Add to dashboard Cite

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) -one of the most heavily used kernels in scientific computing -across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms.

show abstract

Section: Oski Oski-petsc and Related Workmentioning

confidence: 99%

“…Better low-level tuning of the kind proposed in this paper, even applied to just a CSR SpMV, are also possible. Recent work on low-level tuning of SpMV by unroll-and-jam [12], software pipelining [6], and prefetching [17] influence our work. See [19] for an extensive overview of SPMV optimization techniques.…”

Section: Oski Oski-petsc and Related Workmentioning

confidence: 99%

Optimization of sparse matrix–vector multiplication on emerging multicore platforms

et al. 2009

View full text Add to dashboard Cite

show abstract

“…These patterns include blocks [13], variable or mixtures of differently-sized blocks [12] diagonals, which may be especially wellsuited to machines with SIMD and vector units [32,28], general pattern compression [33], value compression [15], and combinations.…”

Section: Related Workmentioning

confidence: 99%

“…Researchers have also examined low-level tuning of SpMV by unroll-and-jam [20], and software pipelining [12], and prefetching [26]. A completely recursive layout for SpMV, motivated by CSB, is recently examined by Martone et al [19].…”

Section: Related Workmentioning

confidence: 99%

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Buluç

Williams

Oliker

et al. 2011

2011 IEEE International Parallel &Amp; Distributed Processing Symposium

109

View full text Add to dashboard Cite

Abstract-On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue.

show abstract

“…The inspiration for this study comes from recent work on splitting by Geus and Röllin [11], Pinar and Heath [25], and Toledo [33], and the performance gap we have observed informally [15,37,14]. Geus and Röllin explore up to 3-way splittings for a particular application matrix used in accelerator cavity design, but the splitting terms are still based on row-aligned BCSR format.…”

Section: Related Workmentioning

confidence: 99%

Fast sparse matrix-vector multiplication by exploiting variable block structure

Vuduc¹,

Moon²

2005

View full text Add to dashboard Cite

We improve the performance of sparse matrix-vector multiply (SpMV) on modern cache-based superscalar machines when the matrix structure consists of multiple, irregularly aligned rectangular blocks. Matrices from finite element modeling applications often have this kind of structure. Our technique splits the matrix, A, into a sum, A 1 + A 2 + . . . + A s , where each term is stored in a new data structure, unaligned block compressed sparse row (UBCSR) format . The classical alternative approach of storing A in a block compressed sparse row (BCSR) format yields limited performance gains because it imposes a particular alignment of the matrix non-zero structure, leading to extra work from explicitly padded zeros. Combining splitting and UBCSR reduces this extra work while retaining the generally lower memory bandwidth requirements and register-level tiling opportunities of BCSR. Using application test matrices, we show empirically that speedups can be as high as 2.1× over not blocking at all, and as high as 1.8× over the standard BCSR implementation used in prior work. When performance does not improve, split UBCSR can still significantly reduce matrix storage.Through extensive experiments, we further show that the empirically optimal number of splittings s and the block size for each matrix term A i will in practice depend on the matrix and hardware platform. Our data lay a foundation for future development of fully automated methods for tuning these parameters.

show abstract

Towards a Fast Parallel Sparse Matrix-Vector Multiplication

Cited by 22 publications

References 5 publications

Optimization of sparse matrix–vector multiplication on emerging multicore platforms

Optimization of sparse matrix–vector multiplication on emerging multicore platforms

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

Fast sparse matrix-vector multiplication by exploiting variable block structure

Contact Info

Product

Resources

About