2023 IEEE International Symposium on Workload Characterization (IISWC) 2023
DOI: 10.1109/iiswc59245.2023.00026
|View full text |Cite
|
Sign up to set email alerts
|

Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware

Suchita Pati,
Shaizeen Aga,
Mahzabeen Islam
et al.
Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(8 citation statements)
references
References 28 publications
2
6
0
Order By: Relevance
“…For very large models (e.g., PALM, MT-NLG) we consider 32-way slicing, and for futuristic ones with one and ten trillion parameters, we consider 64-way sharding. The increasing TP slicing is necessary because these models' larger sizes cannot fit in 16 GPUs [64] and the increased slicing is also enabled by nodes with larger device counts [59,86]. Like prior work [36,49,64], we find that communication is a considerable fraction of the overall runtime: Megatron-GPT-2 (Mega-GPT-2) and T-NLG spend up to 34% and 43% of their training and inference (prompt phase) time on communication.…”
Section: All-reduce Is On the Critical Path And Can Be Largesupporting
confidence: 50%
See 4 more Smart Citations
“…For very large models (e.g., PALM, MT-NLG) we consider 32-way slicing, and for futuristic ones with one and ten trillion parameters, we consider 64-way sharding. The increasing TP slicing is necessary because these models' larger sizes cannot fit in 16 GPUs [64] and the increased slicing is also enabled by nodes with larger device counts [59,86]. Like prior work [36,49,64], we find that communication is a considerable fraction of the overall runtime: Megatron-GPT-2 (Mega-GPT-2) and T-NLG spend up to 34% and 43% of their training and inference (prompt phase) time on communication.…”
Section: All-reduce Is On the Critical Path And Can Be Largesupporting
confidence: 50%
“…For larger Transformers like PALM [12], GPT-3 [9], and MT-NLG [80]) we use a higher slicing degree of 32 given their increasingly large memory capacity requirements [64] and availability of nodes with larger device counts that can enable this slicing [29,59,76]. We evaluate mixed-precision training which entails half-precision (FP16) forward and backpropagation and single-precision (FP32) weight updates.…”
Section: Applications Deployment and Gemmsmentioning
confidence: 99%
See 3 more Smart Citations