Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Patwary, Md. Mostofa Ali; Satish, Nadathur; Sundaram, Narayanan; Park, Jongsoo; Anderson, Michael J.; Vadlamudi, Satya Gautam; Das, Dipankar; Pudov, Sergey; Pirogov, Vadim; Dubey, Pradeep

doi:10.1007/978-3-319-20119-1_4

Cited by 47 publications

(23 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Problems where B is larger than hbm requires partitioning of B. Column-wise partitions have been explored in one level memory before [29]. However, since our data is stored row-wise, finding column-wise partitions that will fit into hbm is usually prohibitively expensive.…”

Section: Chunking Methods For Knlsmentioning

confidence: 99%

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments

Deveci

Hammond

Wolf

et al. 2018

View full text Add to dashboard Cite

Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading highperformance computing architectures -Intel's Knights Landing processor and NVIDIA's Pascal GPU. We describe a data placement method and a chunking-based algorithm for our kernels that exploits the existence of the multiple memory spaces in each hardware platform. We evaluate the performance of these methods w.r.t. standard algorithms using the auto-caching mechanisms. Our results show that standard algorithms that exploit cache reuse performed as well as multi-memory-aware algorithms for architectures such as knls where the memory subsystems have similar latencies. However, for architectures such as gpus where memory subsystems differ significantly in both bandwidth and latency, multi-memory-aware methods are crucial for good performance. In addition, our new approaches permit the user to run problems that require larger capacities than the fastest memory of each compute node without depending on the software-managed cache mechanisms.

show abstract

Section: Chunking Methods For Knlsmentioning

confidence: 99%

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments

Deveci

Hammond

Wolf

et al. 2018

View full text Add to dashboard Cite

show abstract

“…An alternative is another format with random access such as a hash map. These result in slower execution [Patwary et al 2015], but only use memory proportional to the number of nonzeros.…”

Section: Policy and Choice Of Workpacementioning

confidence: 99%

“…A workspace used for accumulating temporary values is referred to as an expanded real accumulator in [Pissanetzky 1984] and as an abstract sparse accumulator data structure in [Gilbert et al 1992]. Dense workspaces and blocking are used to produce fast parallel code by Patwary et al [Patwary et al 2015]. They also tried a hash map workspace, but report that it did not have good performance for their use.…”

Section: Related Workmentioning

confidence: 99%

Tensor Algebra Compilation with Workspaces

Kjølstad¹,

Ahrens²,

Kamil

et al. 2019

2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

View full text Add to dashboard Cite

This paper shows how to optimize sparse tensor algebraic expressions by introducing temporary tensors, called workspaces, into the resulting loop nests. We develop a new intermediate language for tensor operations called concrete index notation that extends tensor index notation. Concrete index notation expresses when and where sub-computations occur and what tensor they are stored into. We then describe the workspace optimization in this language, and how to compile it to sparse code by building on prior work in the literature.We demonstrate the importance of the optimization on several important sparse tensor kernels, including sparse matrix-matrix multiplication (SpMM), sparse tensor addition (SpAdd), and the matricized tensor times Khatri-Rao product (MTTKRP) used to factorize tensors. Our results show improvements over prior work on tensor algebra compilation and brings the performance of these kernels on par with state-of-the-art hand-optimized implementations. For example, SpMM was not supported by prior tensor algebra compilers, the performance of MTTKRP on the nell-2 data set improves by 35%, and MTTKRP can for the first time have sparse results.Additional

show abstract

“…T AB: Optimizing sparse matrixmatrix multiplication is an active area of research [17], [18]; state-of-the-art implementations are bound by the memory bandwidth and heavily underutilize the compute resources.…”

Section: ) Optimizing Res = Amentioning

confidence: 99%

A Multi-Platform Evaluation of the Randomized CX Low-Rank Matrix Factorization in Spark

Gittens

Kottalam

Yang

et al. 2016

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

Abstract-We investigate the performance and scalability of the randomized CX low-rank matrix factorization and demonstrate its applicability through the analysis of a 1TB mass spectrometry imaging (MSI) dataset, using Apache Spark on an Amazon EC2 cluster, a Cray XC40 system, and an experimental Cray cluster. We implemented this factorization both as a parallelized C implementation with hand-tuned optimizations and in Scala using the Apache Spark highlevel cluster computing framework. We obtained consistent performance across the three platforms: using Spark we were able to process the 1TB size dataset in under 30 minutes with 960 cores on all systems, with the fastest times obtained on the experimental Cray cluster. In comparison, the C implementation was 21X faster on the Amazon EC2 system, due to careful cache optimizations, bandwidth-friendly access of matrices and vector computation using SIMD units. We report these results and their implications on the hardware and software issues arising in supporting data-centric workloads in parallel and distributed environments.

show abstract

Parallel Efficient Sparse Matrix-Matrix Multiplication on Multicore Platforms

Cited by 47 publications

References 12 publications

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments

Tensor Algebra Compilation with Workspaces

A Multi-Platform Evaluation of the Randomized CX Low-Rank Matrix Factorization in Spark

Contact Info

Product

Resources

About