2021
DOI: 10.1007/978-3-030-78713-4_12
|View full text |Cite
|
Sign up to set email alerts
|

COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 17 publications
0
3
0
Order By: Relevance
“…Data distribution. COnf LUX and COnf CHOX provide ScaLA-PACK wrappers by using the highly-optimized COSTA algorithm [38] to transform the matrices between different layouts. In addition, they support the COSTA API for matrix descriptors, which is more general than ScaLAPACK's layout, as it supports matrices distributed in arbitrary grid-like layouts, processor assignments, and local blocks orderings.…”
Section: Methodsmentioning
confidence: 99%
“…Data distribution. COnf LUX and COnf CHOX provide ScaLA-PACK wrappers by using the highly-optimized COSTA algorithm [38] to transform the matrices between different layouts. In addition, they support the COSTA API for matrix descriptors, which is more general than ScaLAPACK's layout, as it supports matrices distributed in arbitrary grid-like layouts, processor assignments, and local blocks orderings.…”
Section: Methodsmentioning
confidence: 99%
“…Each device computes the attention matrix for all attention heads but only with the samples in their own partitions. Next, we use COSTA [28] to efficiently shuffle partitions so that each device has the attention scores for all samples but with only 1/N heads. This step prepares for the softmax calculation that requires all examples for each embedding dimension, and it can be computed sequentially by each device.…”
Section: Distributed Attentionmentioning
confidence: 99%
“…scale, transpose or conjugate the received package S i,id if needed 6 Implementation Details COSTA (Algorithm 3) is implemented using the hybrid MPI+OpenMP parallelization model. The code is publicly available under the BSD-3 Clause Licence at [11]. It has the following features: 1) provides the ScaLAPACK wrappers for pxgemr2d and pxtran; 2) supports arbitrary grid-like matrix layouts (not limited to block-cyclic).…”
mentioning
confidence: 99%