Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Azad, Ariful; Ballard, Grey; Buluç, Aydın; Demmel, James; Grigori, Laura; Schwartz, Oded; Toledo, Sivan; Williams, Samuel

doi:10.1137/15m104253x

Cited by 90 publications

(90 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…value ← value ⊕ A T (i, j) ⊗ v(j) Our column-based masked matvec follows Gustavson's algorithm for SpGEMM (sparse matrix-sparse matrix multiplication), but specialized to matvec [19]. The key challenge in parallelizing Gustavson's algorithm is solving the multiway merge problem [1]. For the GPU, our parallelization approach follows the scan-gather-sort approach outlined by Yang et al [32] and is shown in Algorithm 3.…”

Section: Row-based Masked Matvec (Pull Phase)mentioning

confidence: 99%

Implementing Push-Pull Efficiently in GraphBLAS

Yang

Buluç

Owens

2018

Proceedings of the 47th International Conference on Parallel Processing

View full text Add to dashboard Cite

We factor Beamer's push-pull, also known as direction-optimized breadth-first-search (DOBFS) into 3 separable optimizations, and analyze them for generalizability, asymptotic speedup, and contribution to overall speedup. We demonstrate that masking is critical for high performance and can be generalized to all graph algorithms where the sparsity pattern of the output is known a priori. We show that these graph algorithm optimizations, which together constitute DOBFS, can be neatly and separably described using linear algebra and can be expressed in the GraphBLAS linear-algebrabased framework. We provide experimental evidence that with these optimizations, a DOBFS expressed in a linear-algebra-based graph framework attains competitive performance with state-ofthe-art graph frameworks on the GPU and on a multi-threaded CPU, achieving 101 GTEPS on a Scale 22 RMAT graph. KEYWORDSsparse matrix multiplication, breadth-first search, graph algorithms ACM Reference Format:

show abstract

Section: Row-based Masked Matvec (Pull Phase)mentioning

confidence: 99%

Implementing Push-Pull Efficiently in GraphBLAS

Yang

Buluç

Owens

2018

Proceedings of the 47th International Conference on Parallel Processing

View full text Add to dashboard Cite

show abstract

“…First, we show light-weight thread scheduling scheme with load-balancing for SpGEMM. Next, we show the optimization schemes for hash table based SpGEMM, which is proposed for GPU [25], and heap based shared-memory SpGEMM algorithms [3]. Additionally, we extend the Hash SpGEMM with utilizing vector registers of Intel Xeon or Xeon Phi.…”

Section: Architecture Specific Optimization Of Spgemmmentioning

confidence: 99%

“…In another variant of SpGEMM [3], we use a priority queue (heap) -indexed by column indices -to accumulate each row of C. To construct c i * , a heap of size nnz(a i * ) is allocated. For every nonzero a ik , the first nonzero entry in b k * along with its column index is inserted into the heap.…”

Section: Heap Spgemmmentioning

confidence: 99%

See 1 more Smart Citation

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures

Nagasaka

Matsuoka

Azad

et al. 2018

Proceedings of the 47th International Conference on Parallel Processing Companion

Self Cite

View full text Add to dashboard Cite

Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi-and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi-and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. We examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.

show abstract

“…In this way load balance is achieved. Linear weak scaling efficiency is possible if, instead of the two-dimensional process grid used in Cannon's algorithm and SUMMA, a three-dimensional process grid is used as in [3,2,11]. However, because of the random permutation of matrix rows and columns, the possibility to exploit the nonzero structure to avoid movement of data or communication is lost.…”

mentioning

confidence: 99%

Parallelization and scalability analysis of inverse factorization using the chunks and tasks programming model

2019

View full text Add to dashboard Cite

We present three methods for distributed memory parallel inverse factorization of block-sparse Hermitian positive definite matrices. The three methods are a recursive variant of the AINV inverse Cholesky algorithm, iterative refinement, and localized inverse factorization, respectively. All three methods are implemented using the Chunks and Tasks programming model, building on the distributed sparse quad-tree matrix representation and parallel matrix-matrix multiplication in the publicly available Chunks and Tasks Matrix Library (CHTML). Although the algorithms are generally applicable, this work was mainly motivated by the need for efficient and scalable inverse factorization of the basis set overlap matrix in large scale electronic structure calculations. We perform various computational tests on overlap matrices for quasi-linear Glutamic Acid-Alanine molecules and three-dimensional water clusters discretized using the standard Gaussian basis set STO-3G with up to more than 10 million basis functions. We show that for such matrices the computational cost increases only linearly with system size for all the three methods. We show both theoretically and in numerical experiments that the methods based on iterative refinement and localized inverse factorization outperform previous parallel implementations in weak scaling tests where the system size is increased in direct proportion to the number of processes. We show also that compared to the method based on pure iterative refinement the localized inverse factorization requires much less communication.

show abstract

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Cited by 90 publications

References 38 publications

Implementing Push-Pull Efficiently in GraphBLAS

Implementing Push-Pull Efficiently in GraphBLAS

High-Performance Sparse Matrix-Matrix Products on Intel KNL and Multicore Architectures

Parallelization and scalability analysis of inverse factorization using the chunks and tasks programming model

Contact Info

Product

Resources

About