Optimizing sparse tensor times matrix on GPUs

Ma, Yuchen; Li, Jiajia; Wu, Xiaolong; Yan, Chenggang; Sun, Jimeng; Vuduc, Richard

doi:10.1016/j.jpdc.2018.07.018

Cited by 31 publications

(14 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Methods for Tucker decomposition include higher-order SVD (HOSVD) [32], truncated HOSVD [32], Alternating Least Squares (ALS) based methods [66], the popular higher-order orthogonal iteration (HOOI) [33], Newton Grassmann optimization [36]. Sparse Tucker also comes from two aspects: the sparse tensor from applications [83,89,90,127] and the constrained sparse factors. e computational tensor kernel of Tucker decomposition is the Tensor-Times-Matrix operation (T ) (will be described in Section 4.4).…”

Section: Tensor Methodsmentioning

confidence: 99%

See 1 more Smart Citation

PASTA: a parallel sparse tensor algorithm benchmark suite

et al. 2019

CCF Trans. HPC

Self Cite

View full text Add to dashboard Cite

Tensor methods have gained increasingly a ention from various applications, including machine learning, quantum chemistry, healthcare analytics, social network analysis, data mining, and signal processing, to name a few. Sparse tensors and their algorithms become critical to further improve the performance of these methods and enhance the interpretability of their output. is work presents a sparse tensor algorithm benchmark suite (PASTA) for single-and multi-core CPUs. To the best of our knowledge, this is the rst benchmark suite for sparse tensor world. PASTA targets on: 1) helping application users to evaluate di erent computer systems using its representative computational workloads; 2) providing insights to be er utilize existed computer architecture and systems and inspiration for the future design. is benchmark suite will be publicly released. ACM Reference format:

show abstract

Section: Tensor Methodsmentioning

confidence: 99%

“…is T algorithm directly operates on the input sparse tensor by avoiding tensor transformation. e explanation of Algorithm 5 can be found in the work [85,90].…”

Section: Tmentioning

confidence: 99%

PASTA: a parallel sparse tensor algorithm benchmark suite

et al. 2019

CCF Trans. HPC

Self Cite

View full text Add to dashboard Cite

show abstract

“…Shaden et al, [41] used a Compressed Sparse Tensors (CSF) structure which can optimize the access efficiency for HOHDST. Tensor-Time-Matrix-chain (TTMc) [42] is a key part for Tucker Decomposition (TD) and TTMc is a data intensive task. Ma et al, [42] optimized the TTMc operation on GPU which can take advantage of intensive and partitioned computational resource of GPU, i.e., a warp threads (32) are automatically synchronized and this mechanism is apt to matrices blockblock multiplication.…”

Section: Related Studiesmentioning

confidence: 99%

SGD_Tucker: A Novel Stochastic Optimization Strategy for Scalable Parallel Sparse Tucker Decomposition

Rellermeyer

et al. 2021

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Sparse Tucker Decomposition (STD) algorithms learn a core tensor and a group of factor matrices to obtain an optimal low-rank representation feature for the High-Order, High-Dimension, and Sparse Tensor (HOHDST). However, existing STD algorithms face the problem of intermediate variables explosion which results from the fact that the formation of those variables, i.e., matrices Khatri-Rao product, Kronecker product, and matrix-matrix multiplication, follows the whole elements in sparse tensor. The above problems prevent deep fusion of efficient computation and big data platforms. To overcome the bottleneck, a novel stochastic optimization strategy (SGD Tucker) is proposed for STD which can automatically divide the high-dimension intermediate variables into small batches of intermediate matrices. Specifically, SGD Tucker only follows the randomly selected small samples rather than the whole elements, while maintaining the overall accuracy and convergence rate. In practice, SGD Tucker features the two distinct advancements over the state of the art. First, SGD Tucker can prune the communication overhead for the core tensor in distributed settings. Second, the low data-dependence of SGD Tucker enables fine-grained parallelization, which makes SGD Tucker obtaining lower computational overheads with the same accuracy. Experimental results show that SGD Tucker runs at least 2X faster than the state of the art.

show abstract

“…On the Intel many-core processor KNL (Knights Landing), the computation of CP decomposition is balanced among the processing units, which leads to 1.8 × performance speedup . Li et al (2016); Ma et al (2019) propose an optimized design of sparse tensor-times-dense matrix multiply on GPU that exploits fine thread granularity, coalesced memory access, rank blocking and fast shared memory. F-COO (Liu et al 2017) proposes a unified tensor format along with GPU-specific optimizations that leverages the similar computation patterns between tensor operations.…”

Section: Related Workmentioning

confidence: 99%

swTensor: accelerating tensor decomposition on Sunway architecture

et al. 2019

View full text Add to dashboard Cite

Modern applications are digesting and generating data with rich features that are stored in high dimensional array or tensor. The computation applied to tensor, such as Canonical Polyadic decomposition (CP decomposition) plays an important role in understanding the internal relationships within the data. Using CP decomposition to analyze large tensor with billions of sizes requires tremendous computation power. In the meanwhile, the emerging Sunway many-core processor has demonstrated its computation advantage in powering the first hundred petaFLOPS supercomputer in the world. In this paper, we propose swTensor that adapts the CP decomposition to Sunway processor by leveraging the MapReduce framework for automatic parallelization and the unique architecture of Sunway for high performance. Specifically, we divide the major computation of CP decomposition into four sub-procedures and implement each using MapReduce framework with customized design key-value pair. Also, we tile the data during the computation so that it fits into the limited local device memory on Sunway for better performance. Moreover, we propose a performance auto-tuning mechanism to search for the optimal parameter settings in swTensor. The experimental results demonstrate swTensor achieves better performance than the state-of-the-art BigTensor and CSTF with the average speedup of 1.36 × and 1.24 × , respectively. Besides, swTensor exhibits better scalability when scaling across multiple Sunway processors.

show abstract

Optimizing sparse tensor times matrix on GPUs

Cited by 31 publications

References 25 publications

PASTA: a parallel sparse tensor algorithm benchmark suite

PASTA: a parallel sparse tensor algorithm benchmark suite

SGD_Tucker: A Novel Stochastic Optimization Strategy for Scalable Parallel Sparse Tucker Decomposition

swTensor: accelerating tensor decomposition on Sunway architecture

Contact Info

Product

Resources

About