Sparse matrix-matrix multiplication on modern architectures

Matam, Kiran Kumar; Indarapu, Siva Rama Krishna Bharadwaj; Kothapalli, Kishore

doi:10.1109/hipc.2012.6507483

Cited by 40 publications

(20 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, MAGMA employs task-based work distribution, in contrast to the symmetric data-parallel approach used in this work. Matam et al [17] have implemented a hybrid CPU/GPU solver for sparse matrix-multiplication. However, they do not scale their solution beyond a single node.…”

Section: A Related Workmentioning

confidence: 99%

Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems

Kreutzer

Pieper

Hager

et al. 2015

2015 IEEE International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-The Kernel Polynomial Method (KPM) is a wellestablished scheme in quantum physics and quantum chemistry to determine the eigenvalue density and spectral properties of large sparse matrices. In this work we demonstrate the high optimization potential and feasibility of peta-scale heterogeneous CPU-GPU implementations of the KPM. At the node level we show that it is possible to decouple the sparse matrix problem posed by KPM from main memory bandwidth both on CPU and GPU. To alleviate the effects of scattered data access we combine loosely coupled outer iterations with tightly coupled block sparse matrix multiple vector operations, which enables pure data streaming. All optimizations are guided by a performance analysis and modelling process that indicates how the computational bottlenecks change with each optimization step. Finally we use the optimized node-level KPM with a hybrid-parallel framework to perform large scale heterogeneous electronic structure calculations for novel topological materials on a petascale-class Cray XC30 system. Keywords-Parallel programming, Quantum mechanics, Performance analysis, Sparse matricesIt is widely accepted that future supercomputer architectures will change considerably compared to the machines used at present for large scale simulations. Extreme parallelism, use of heterogeneous compute devices and a steady decrease in the architectural balance in terms of main memory bandwidth vs. peak performance are important factors to consider when developing and implementing sustainable code structures. Accelerator-based systems already account for a performance share of 34% of the total TOP500 [1] today, and they may provide first blueprints of future architectural developments. The heterogeneous hardware structure typically calls for a completely new software development, in particular if the simultaneous use of all compute devices is addressed to maximize performance and energy efficiency.A prominent example demonstrating the need for new software implementations and structures is the MAGMA project [2]. In dense linear algebra the code balance (bytes/flop) of basic operations can often be reduced by blocking techniques to better match the machine balance. Thus, this community is expected to achieve high absolute performance also on future supercomputers. In contrast, sparse linear algebra is known for low sustained performance on state of the art homogeneous systems. The sparse matrix vector multiplication (SpMV) is often the performance-critical step.Most of the broad research on optimal SpMV data structures has been devoted to drive the balance of a general SpMV (not using any special matrix properties) down to its minimum value of 6 bytes/flop (double precision) or 2.5 bytes/flop (double complex) on all architectures, which is still at least an order of magnitude away from current machine balance numbers. Just recently the long known idea of applying the sparse matrix to multiple vectors at the same time (SpMMV) (see, e.g., [3]), to reduce computational balance has gai...

show abstract

Section: A Related Workmentioning

confidence: 99%

Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems

Kreutzer

Pieper

Hager

et al. 2015

2015 IEEE International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

show abstract

“…The peak multiplication performance is 16GFlops/s, and the overall peak performance (multiplication+addition) is 32GFlops/s. The roof in our condition is 23 over our performance, and 9.6× over OuterSPACE. We are much nearer to the roof comparing to OuterSPACE.…”

Section: B Experimental Resultsmentioning

confidence: 80%

“…It achieves good output reuse with pipelined multiply and merge, matrix condensing, Huffman tree scheduler, good input reuse with row prefetcher. memory access pattern and poor locality caused by lowdensity matrices [21], [22], [23]. For instance, the density of Twitter's [24] adjacency matrix is as low as 0.000214%.…”

Section: Introductionmentioning

confidence: 99%

SpArch: Efficient Architecture for Sparse Matrix Multiplication

Zhang

Wang

Han

et al. 2020

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

177

View full text Add to dashboard Cite

Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a ubiquitous task in various engineering and scientific applications. However, inner product based SpGEMM introduces redundant input fetches for mismatched nonzero operands, while outer product based approach [1] suffers from poor output locality due to numerous partial product matrices. Inefficiency in the reuse of either inputs or outputs data leads to extensive and expensive DRAM access.To address this problem, this paper proposes an efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices. We first design a highly parallelized streamingbased merger to pipeline the multiply and merge stage of partial matrices so that partial matrices are merged on chip immediately after produced. We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4×. We further develop a Huffman tree scheduler to improve the scalability of the merger for larger sparse matrices, which reduces the DRAM access by another 1.8×. We also resolve the increased input matrix read induced by the new representation using a row prefetcher with near-optimal buffer replacement policy, further reducing the DRAM access by 1.5×. Evaluated on 20 benchmarks, SpArch reduces the total DRAM access by 2.8× over previous state-of-the-art. On average, SpArch achieves 4×, 19×, 18×, 17×, 1285× speedup and 6×, 164×, 435×, 307×, 62× energy savings over OuterSPACE, MKL, cuSPARSE, CUSP, and ARM Armadillo, respectively.

show abstract

“…Regular matrices result from problems involving mesh approximations, e.g., from finite element methods, while irregular matrices mostly result from network structures. These matrices were also used for performance tests by [32] and therefore provide a basis for comparison. The matrix mouse280 originates from a finite difference mesh (using a seven-point stencil) which models the diffuse light propagation inside a mouse [21].…”

Section: Performance Measurementsmentioning

confidence: 99%

“…Considering the same example application as in [12,31,32], we measured the time to compute the square of a sparse matrix C = AA. To assess the scalability, the performance was measured as a function of the matrix width for three-dimensional Poisson matrices ( Figure 5).…”

Section: Matrix Squaring: Performance Comparisonmentioning

confidence: 99%

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging

Gremse¹,

Höfter²,

Schwen³

et al. 2015

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

We present an algorithm for general sparse matrix-matrix multiplication (SpGEMM) on many-core architectures, such as GPUs. SpGEMM is implemented by iterative row merging, similar to merge sort, except that elements with duplicate column indices are aggregated on the fly. The main kernel merges small numbers of sparse rows at once using subwarps of threads to realize an early compression effect which reduces the overhead of global memory accesses. The performance is compared with a parallel CPU implementation as well as with three GPU-based implementations. Measurements performed for computing the matrix square for 21 sparse matrices show that the proposed method consistently outperforms the other methods. Analysis showed that the performance is achieved by utilizing the compression effect and the GPU caching architecture. An improved performance was also found for computing Galerkin products which are required by algebraic multigrid solvers. The performance was particularly good for seven-point stencil matrices arising in the context of diffuse optical imaging and the improved performance allows one to perform image reconstruction at higher resolution using the same computational resources.

show abstract

Sparse matrix-matrix multiplication on modern architectures

Cited by 40 publications

References 23 publications

Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems

Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems

SpArch: Efficient Architecture for Sparse Matrix Multiplication

GPU-Accelerated Sparse Matrix-Matrix Multiplication by Iterative Row Merging

Contact Info

Product

Resources

About