A Medium-Grained Algorithm for Sparse Tensor Factorization

Smith, Shaden; Karypis, George

doi:10.1109/ipdps.2016.113

Cited by 49 publications

(62 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In particular, Liavas et al [6] extend a parallel algorithm designed for sparse tensors [25] to the 3D dense case. They use the "medium-grained" dense tensor distribution and rowwise factor matrix distribution, which is exactly the same as our distribution strategy (see section IV-C2), and they use a Nesterov-based algorithm to enforce the nonnegativity constraints.…”

Section: Related Workmentioning

confidence: 99%

Parallel Nonnegative CP Decomposition of Dense Tensors

Ballard

Hayashi

Kannan

2018

2018 IEEE 25th International Conference on High Performance Computing (HiPC)

View full text Add to dashboard Cite

The CP tensor decomposition is a low-rank approximation of a tensor. We present a distributed-memory parallel algorithm and implementation of an alternating optimization method for computing a CP decomposition of dense tensor data that can enforce nonnegativity of the computed low-rank factors. The principal task is to parallelize the matricized-tensor times Khatri-Rao product (MTTKRP) bottleneck subcomputation. The algorithm is computation efficient, using dimension trees to avoid redundant computation across MTTKRPs within the alternating method. Our approach is also communication efficient, using a data distribution and parallel algorithm across a multidimensional processor grid that can be tuned to minimize communication. We benchmark our software on synthetic as well as hyperspectral image and neuroscience dynamic functional connectivity data, demonstrating that our algorithm scales well to 100s of nodes (up to 4096 cores) and is faster and more general than the currently available parallel software.

show abstract

Section: Related Workmentioning

confidence: 99%

Parallel Nonnegative CP Decomposition of Dense Tensors

Ballard

Hayashi

Kannan

2018

2018 IEEE 25th International Conference on High Performance Computing (HiPC)

View full text Add to dashboard Cite

show abstract

“…Prior work has also studied the Tucker decomposition on the MapReduce platform [11]. Other tensor decompositions such as CP factorization have been explored as well (e.g., [13,16,12,25,14]).…”

Section: Procedurementioning

confidence: 99%

“…In this section, we discuss prior schemes proposed in the context of Tucker decomposition [15], as well the related CP decomposition [25]. The schemes can be categorized in to three types.…”

Section: Prior Distribution Schemesmentioning

confidence: 99%

See 1 more Smart Citation

On Optimizing Distributed Tucker Decomposition for Sparse Tensors

Chakaravarthy¹,

Choi²,

Joseph³

et al. 2018

Proceedings of the 2018 International Conference on Supercomputing

View full text Add to dashboard Cite

The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input tensor among the processors (MPI ranks) critically influences the HOOI execution time. Prior work has proposed different distribution schemes: an offline scheme based on sophisticated hypergraph partitioning method and simple, lightweight alternatives that can be used real-time. While the hypergraph based scheme typically results in faster HOOI execution time, being complex, the time taken for determining the distribution is an order of magnitude higher than the execution time of a single HOOI iteration. Our main contribution is a lightweight distribution scheme, which achieves the best of both worlds. We show that the scheme is nearoptimal on certain fundamental metrics associated with the HOOI procedure and as a result, near-optimal on the computational load (FLOPs). Though the scheme may incur higher communication volume, the computation time is the dominant factor and as the result, the scheme achieves better performance on the overall HOOI execution time. Our experimental evaluation on large real-life tensors (having up to 4 billion elements) shows that the scheme outperforms the prior schemes on the HOOI execution time by a factor of up to 3x. On the other hand, its distribution time is comparable to the prior lightweight schemes and is typically lesser than the execution time of a single HOOI iteration.

show abstract

“…In regards to tensor factorization, designing high performance implementations for CP-ALS, as well as measuring their performance, is an active area of research [22]. There have been efforts to perform tensor factorization on both shared and distributed memory systems [23], [24], [25], as well as on GPUs [26], [27]. However, to the best of our knowledge, ReFacTo is the only current implementation of CP-ALS that runs on multiple GPUs in a distributed fashion and is able to utilize GPU communication hardware and software.…”

Section: Related Workmentioning

confidence: 99%

An Empirical Evaluation of Allgatherv on Multi-GPU Systems

Rolinger¹,

Simon²,

Krieger³

2018

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

View full text Add to dashboard Cite

Applications for deep learning and big data analytics have compute and memory requirements that exceed the limits of a single GPU. However, effectively scaling out an application to multiple GPUs is challenging due to the complexities of communication between the GPUs, particularly for collective communication with irregular message sizes. In this work, we provide a performance evaluation of the Allgatherv routine on multi-GPU systems, focusing on GPU network topology and the communication library used. We present results from the OSUmicro benchmark as well as conduct a case study for sparse tensor factorization, one application that uses Allgatherv with highly irregular message sizes. We extend our existing tensor factorization tool to run on systems with different node counts and varying number of GPUs per node. We then evaluate the communication performance of our tool when using traditional MPI, CUDA-aware MVAPICH and NCCL across a suite of realworld data sets on three different systems: a 16-node cluster with one GPU per node, NVIDIA's DGX-1 with 8 GPUs and Cray's CS-Storm with 16 GPUs. Our results show that irregularity in the tensor data sets produce trends that contradict those in the OSU micro-benchmark, as well as trends that are absent from the benchmark.

show abstract

A Medium-Grained Algorithm for Sparse Tensor Factorization

Cited by 49 publications

References 18 publications

Parallel Nonnegative CP Decomposition of Dense Tensors

Parallel Nonnegative CP Decomposition of Dense Tensors

On Optimizing Distributed Tucker Decomposition for Sparse Tensors

An Empirical Evaluation of Allgatherv on Multi-GPU Systems

Contact Info

Product

Resources

About