Optimizing Sparse Tensor Times Matrix on Multi-core and Many-Core Architectures

Li, Jiajia; Ma, Yuchen; Yan, Chenggang; Vuduc, Richard

doi:10.1109/ia3.2016.010

Cited by 23 publications

(27 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Figure 1. COO format for a general sparse tensor and sCOO format [41] for a semi-sparse tensor. Table 1 presents the operational intensity of each kernel using a cubical third-order tensor, while all the implementations in the benchmark suite support arbitrary tensor orders.…”

Section: Coordinate Format (Coo)mentioning

confidence: 99%

“…However, the transformation process brings non-trivial overhead to the execution arXiv:2001.00660v1 [cs.DC] 2 Jan 2020 of a tensor operation. Mitigating this cost has become attractive for researchers in tensor linear algebra and their applications [17,37,41,48,60]. Irregularity in memory access pa erns and in tensor shape makes poor use of memory subsystems and complicates code, especially for sparse data.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A parallel sparse tensor benchmark suite on CPUs and GPUs

Lakshminarasimhan

et al. 2020

Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Self Cite

View full text Add to dashboard Cite

Tensor computations present signi cant performance challenges that impact a wide spectrum of applications ranging from machine learning, healthcare analytics, social network analysis, data mining to quantum chemistry and signal processing. E orts to improve the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor kernels. is work presents a benchmark suite for arbitrary-order sparse tensor kernels using state-of-the-art tensor formats: coordinate (COO) and hierarchical coordinate (HiCOO) on CPUs and GPUs. It presents a set of reference tensor kernel implementations that are compatible with real-world tensors and power law tensors extended from synthetic graph generation techniques. We also propose Roo ine performance models for these kernels to provide insights of computer platforms from sparse tensor view.

show abstract

Section: Coordinate Format (Coo)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A parallel sparse tensor benchmark suite on CPUs and GPUs

Lakshminarasimhan

et al. 2020

Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Self Cite

View full text Add to dashboard Cite

show abstract

“…is T algorithm directly operates on the input sparse tensor by avoiding tensor transformation. e explanation of Algorithm 5 can be found in the work [85,90].…”

Section: Tmentioning

confidence: 99%

PASTA: a parallel sparse tensor algorithm benchmark suite

et al. 2019

CCF Trans. HPC

Self Cite

View full text Add to dashboard Cite

Tensor methods have gained increasingly a ention from various applications, including machine learning, quantum chemistry, healthcare analytics, social network analysis, data mining, and signal processing, to name a few. Sparse tensors and their algorithms become critical to further improve the performance of these methods and enhance the interpretability of their output. is work presents a sparse tensor algorithm benchmark suite (PASTA) for single-and multi-core CPUs. To the best of our knowledge, this is the rst benchmark suite for sparse tensor world. PASTA targets on: 1) helping application users to evaluate di erent computer systems using its representative computational workloads; 2) providing insights to be er utilize existed computer architecture and systems and inspiration for the future design. is benchmark suite will be publicly released. ACM Reference format:

show abstract

“…This paper expands on the first independent characterization of the Emu Chick prototype [1] by exploring multiple distributed nodes that consist of those nodelets (see Section 2). Our study uses microbenchmarks and small kernels-namely, STREAM, pointer chasing, and sparse matrix-vector multiplication (SpMV)-as proxies that reflect some of the key characteristics of our motivating computations, which come from sparse and irregular applications [4,5]. Indeed, one larger goal of our work beyond this paper is to develop a performance-portable, Emu-compatible API for Georgia Tech's STINGER open-source streaming graph framework [4] and ParTI [6] tensor decomposition algorithms (e.g.…”

Section: Introductionmentioning

confidence: 99%

A microbenchmark characterization of the Emu chick

Young

Hein²,

Eswar

et al. 2019

Parallel Computing

Self Cite

View full text Add to dashboard Cite

The Emu Chick is a prototype system designed around the concept of migratory memory-side processing. Rather than transferring large amounts of data across power-hungry, high-latency interconnects, the Emu Chick moves lightweight thread contexts to near-memory cores before the beginning of each memory read. The current prototype hardware uses FPGAs to implement cache-less "Gossamer" cores for computational work and rely on a typical stationary core (PowerPC) to run basic operating system functions and migrate threads between nodes. In this multi-node characterization of the Emu Chick, we extend an earlier single-node investigation [1] of the the memory bandwidth characteristics of the system through benchmarks like STREAM, pointer chasing, and sparse matrix-vector multiplication. We compare the Emu Chick hardware to architectural simulation and an Intel Xeon-based platform. Our results demonstrate that for many basic operations the Emu Chick can use available memory bandwidth more efficiently than a more traditional, cache-based architecture although bandwidth usage suffers for computationally intensive workloads like SpMV. Moreover, the Emu Chick provides stable, predictable performance with up to 65% of the peak bandwidth utilization on a random-access pointer chasing benchmark with weak locality.

show abstract

Optimizing Sparse Tensor Times Matrix on Multi-core and Many-Core Architectures

Cited by 23 publications

References 20 publications

A parallel sparse tensor benchmark suite on CPUs and GPUs

A parallel sparse tensor benchmark suite on CPUs and GPUs

PASTA: a parallel sparse tensor algorithm benchmark suite

A microbenchmark characterization of the Emu chick

Contact Info

Product

Resources

About