Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware

Pati, Suchita; Aga, Shaizeen; Islam, Mahzabeen; Jayasena, Nuwan; Sinclair, Matthew D.

doi:10.1109/iiswc59245.2023.00026

Cited by 6 publications

(8 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For very large models (e.g., PALM, MT-NLG) we consider 32-way slicing, and for futuristic ones with one and ten trillion parameters, we consider 64-way sharding. The increasing TP slicing is necessary because these models' larger sizes cannot fit in 16 GPUs [64] and the increased slicing is also enabled by nodes with larger device counts [59,86]. Like prior work [36,49,64], we find that communication is a considerable fraction of the overall runtime: Megatron-GPT-2 (Mega-GPT-2) and T-NLG spend up to 34% and 43% of their training and inference (prompt phase) time on communication.…”

Section: All-reduce Is On the Critical Path And Can Be Largesupporting

confidence: 50%

“…For larger Transformers like PALM [12], GPT-3 [9], and MT-NLG [80]) we use a higher slicing degree of 32 given their increasingly large memory capacity requirements [64] and availability of nodes with larger device counts that can enable this slicing [29,59,76]. We evaluate mixed-precision training which entails half-precision (FP16) forward and backpropagation and single-precision (FP32) weight updates.…”

Section: Applications Deployment and Gemmsmentioning

confidence: 99%

“…These communication patterns are handled by collectives such as reduce-scatter, all-reduce, all-gather, all-to-all. While most of this communication can be hidden by independent compute operations [49,63,64], albeit with some resource contention [36,71], the all-reduce in TP is not (detailed in Section 2.4). Thus, we focus on all-reduce in TP and discuss other techniques/collectives in Sections 7.1 and 7.2.…”

Section: Distributed Techniques and Associated Collectivesmentioning

confidence: 99%

“…Accordingly, their memory and computational demands have also increased, making them increasingly reliant on distributed techniques: multiple accelerators (e.g., GPUs) pooling their collective memory capacities and compute capabilities to collaboratively execute a DNN. However, the resulting communication between devices has become a significant proportion of their execution time and has limited the scaling efficiency with increasing device count [36,49,64].…”

Section: Introductionmentioning

confidence: 99%

“…Tensor-parallelism (TP), a type of MP, requires an all-reduce of layer outputs between devices as well. Among these distributed techniques, TP's communication typically lies on the critical path of model execution, as shown in Figure 1(a) and can be a significant proportion of runtime (∼45% [64]), resulting in a sub-linear increase in throughput as the number of devices increases.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

Pati,

Aga,

Islam

et al. 2024

Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems,

Self Cite

View full text Add to dashboard Cite

Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a finegrained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy.To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention,

show abstract