Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms 2015
DOI: 10.1145/2833179.2833186
|View full text |Cite
|
Sign up to set email alerts
|

Scalable task-based algorithm for multiplication of block-rank-sparse matrices

Abstract: A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and elimi… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
38
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
9

Relationship

3
6

Authors

Journals

citations
Cited by 32 publications
(38 citation statements)
references
References 35 publications
0
38
0
Order By: Relevance
“…The idea of DF reconstruction in batches of course is not novel; what is novel is how batched DF reconstruction is implemented in the context of a distributed memory contraction Wcdabτijcd. This contraction is implemented as a matrix multiplication using an asynchronous (task‐based) 2‐dimensional block‐sparse scalable universal matrix multiplication algorithm (SUMMA) described elsewhere . The standard formulation of SUMMA, in which result matrix is stationary, evaluates contributions to the result from each index cd (since in TiledArray index spaces are tiled, the outer loop of SUMMA is over tiles of cd ).…”
Section: Methodsmentioning
confidence: 99%
“…The idea of DF reconstruction in batches of course is not novel; what is novel is how batched DF reconstruction is implemented in the context of a distributed memory contraction Wcdabτijcd. This contraction is implemented as a matrix multiplication using an asynchronous (task‐based) 2‐dimensional block‐sparse scalable universal matrix multiplication algorithm (SUMMA) described elsewhere . The standard formulation of SUMMA, in which result matrix is stationary, evaluates contributions to the result from each index cd (since in TiledArray index spaces are tiled, the outer loop of SUMMA is over tiles of cd ).…”
Section: Methodsmentioning
confidence: 99%
“…In tensor contractions, the data locality is used such that MPI Raccumulate is intra-node while MPI Rget can be inter-node; we made this decision because MPI Raccumulate is typically not implemented at the hardware level unlike MPI Rget and MPI Rput. The index permutation of tensors is currently performed at the destination; further optimization using a scalable universal matrix multiplication algorithm (SUMMA) 29,30 to avoid the repeated permutation operations will be performed in the future.…”
Section: F Code Generator and Parallelizationmentioning
confidence: 99%
“…114 Several computer libraries capable of efficiently evaluating the final sequence of binary contractions have recently become available. [70][71][72][73] We here use a prototyping library developed by one us, which will be described elsewhere.…”
Section: Wick's Theorem and Tensor Contractionsmentioning
confidence: 99%