Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors

Karsavuran, M. Ozan; Akbudak, Kadir; Aykanat, Cevdet

doi:10.1109/tpds.2015.2453970

Cited by 18 publications

(7 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This pair of operations cannot be calculated simultaneously because they are data dependent, whereas SpMM T V consists of two independent operations. The paper 13 acknowledges that the SpMM T V operations z ← A T x and y ← Aw are used in certain algorithms but does not investigate SpMM T V in detail.…”

Section: Related Workmentioning

confidence: 99%

Joint direct and transposed sparse matrix‐vector multiplication for multithreaded CPUs

Kozický

Šimeček

2021

Concurrency and Computation

View full text Add to dashboard Cite

Repeatedly performing sparse matrix‐vector multiplication (SpMV) followed by transposed sparse matrix‐vector multiplication (SpMTV) with the same matrix is a part of several algorithms, for example, the Lanczos biorthogonalization algorithm and the biconjugate gradient method. Such algorithms can benefit from combining parallel SpMV and SpMTV into a single operation we call joint direct and transposed sparse matrix‐vector multiplication (SpMMTV). In this article, we present a parallel SpMMTV algorithm for shared‐memory CPUs. The algorithm uses a sparse matrix format that divides the stored matrix into sparse matrix blocks and compresses the row and column indices of the matrix. This sparse matrix format can be also used for SpMV, SpMTV, and similar sparse matrix‐vector operations. We expand upon existing research by suggesting new variants of the parallel SpMMTV algorithm and by extending the algorithm to efficiently support symmetric matrices. We compare the performance of the presented parallel SpMMTV algorithm with alternative approaches, which use state‐of‐the‐art sparse matrix formats and libraries, using sparse matrices from real‐world applications. The performance results indicate that the median performance of our proposed parallel SpMMTV algorithm is up to 45% higher than of the alternative approaches.

show abstract

Section: Related Workmentioning

confidence: 99%

Joint direct and transposed sparse matrix‐vector multiplication for multithreaded CPUs

Kozický

Šimeček

2021

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…RCM is used in [31] for bandwidth reduction of sparse matrix A on the Xeon Phi coprocessor. For sparse matrix-vector and matrix-transpose-vector multiplication (SpMMTV), which contains two consecutive SpMVs, Karsavuran et al [32] utilize hypergraph models for exploiting temporal locality on Xeon Phi.…”

Section: Related Workmentioning

confidence: 99%

Spatiotemporal Graph and Hypergraph Partitioning Models for Sparse Matrix-Vector Multiplication on Many-Core Architectures

Abubaker

Akbudak²,

Aykanat

2019

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

There exist graph/hypergraph partitioning-based row/column reordering methods for encoding either spatial or temporal locality separately for sparse matrix-vector multiplication (SpMV) operations. Spatial and temporal hypergraph models in these methods are extended to encapsulate both spatial and temporal localities based on cut/uncut net categorization obtained from vertex partitioning. These extensions of spatial and temporal hypergraph models encode the spatial locality primarily and the temporal locality secondarily, and vice-versa, respectively. However, the literature lacks models that simultaneously encode both spatial and temporal localities utilizing only vertex partitioning for further improving the performance of SpMV on shared-memory architectures. In order to fill this gap, we propose a novel spatiotemporal hypergraph model that leads to a one-phase spatiotemporal reordering method which encodes both types of locality simultaneously. We also propose a framework for spatiotemporal methods which encodes both types of locality in two dependent phases and two separate phases. The validity of the proposed spatiotemporal models and methods are tested on a wide range of sparse matrices and the experiments are performed on both a 60-core Intel Xeon Phi processor and a Xeon processor. Results show the validity of the methods via almost doubling the Gflop/s performance through enhancing data locality in parallel SpMV operations.

show abstract

“…However, the experiments by Beamer, et al have demonstrated that cache blocking is not effective for large scalefree graphs [3]. Others have proposed vertex reordering techniques based on hypergraph partitioning to improve temporal and spatial locality [41], [42]. These techniques require expensive preprocessing operations.…”

Section: Related Workmentioning

confidence: 99%

Improving Efficiency of Parallel Vertex-Centric Algorithms for Irregular Graphs

Özdal

2019

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Memory access is known to be the main bottleneck for shared-memory parallel graph applications especially for large and irregular graphs. Propagation blocking (PB) idea was proposed recently to improve the parallel performance of PageRank and sparse matrix and vector multiplication operations. The idea is based on separating parallel computation into two phases, binning and accumulation, such that random memory accesses are replaced with contiguous accesses. In this paper, we propose an algorithm that allows execution of these two phases concurrently. We propose several improvements to increase parallel throughput, reduce memory overhead, and improve work efficiency. Our experimental results show that our proposed algorithms improve shared-memory parallel throughput by a factor of up to 2x compared to the original PB algorithms. We also show that the memory overhead can be reduced significantly (from 170 percent down to less than 5 percent) without significant degradation of performance. Finally, we demonstrate that our concurrent execution model allows asynchronous parallel execution, leading to significant work efficiency in addition to throughput improvements.

show abstract

Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors

Cited by 18 publications

References 29 publications

Joint direct and transposed sparse matrix‐vector multiplication for multithreaded CPUs

Joint direct and transposed sparse matrix‐vector multiplication for multithreaded CPUs

Spatiotemporal Graph and Hypergraph Partitioning Models for Sparse Matrix-Vector Multiplication on Many-Core Architectures

Improving Efficiency of Parallel Vertex-Centric Algorithms for Irregular Graphs

Contact Info

Product

Resources

About