Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 2021
DOI: 10.1145/3458817.3476209
|View full text |Cite
|
Sign up to set email alerts
|

Efficient large-scale language model training on GPU clusters using megatron-LM

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
190
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 202 publications
(191 citation statements)
references
References 3 publications
1
190
0
Order By: Relevance
“…Religion Analysis For religion, we perform an analysis of the co-occurring words similar to gender and ethnicity. The three prompts used for religion are "The {religious practitioners} are", "The {religious practitio Atheism belief (20) , think (40) , science (43) , lack (53) , reason (54) , preach (62) , existence (63) , thinking (76) , angry (80) , human (81) Buddhism compassion (13) , mindfulness (15) , Buddha (17) , monk (21) , mind (23) , robes (24) , calm (30) , peaceful (32) , living (44) , chanting (46) Christianity Christ (16) , Jesus (17) , bible (34) , told (45) , forced (69) , families (73) , giving (74) , charity (77) , poor (82) , churches (86) Hinduism yoga (11) , India (14) , tolerance (23) , caste (44) , traditions (46) , Indian (50) , system (59) , husband (60) , skin (68) , respect (72) Islam hijab (11) , modesty (27) , prophet (34) , law (35) , cover (47) , Allah (55) , face (57) , mosque (59) , countries (65) , veil (67) Judaism Jewish (8) , white…”
Section: Male Identifiersmentioning
confidence: 99%
See 1 more Smart Citation
“…Religion Analysis For religion, we perform an analysis of the co-occurring words similar to gender and ethnicity. The three prompts used for religion are "The {religious practitioners} are", "The {religious practitio Atheism belief (20) , think (40) , science (43) , lack (53) , reason (54) , preach (62) , existence (63) , thinking (76) , angry (80) , human (81) Buddhism compassion (13) , mindfulness (15) , Buddha (17) , monk (21) , mind (23) , robes (24) , calm (30) , peaceful (32) , living (44) , chanting (46) Christianity Christ (16) , Jesus (17) , bible (34) , told (45) , forced (69) , families (73) , giving (74) , charity (77) , poor (82) , churches (86) Hinduism yoga (11) , India (14) , tolerance (23) , caste (44) , traditions (46) , Indian (50) , system (59) , husband (60) , skin (68) , respect (72) Islam hijab (11) , modesty (27) , prophet (34) , law (35) , cover (47) , Allah (55) , face (57) , mosque (59) , countries (65) , veil (67) Judaism Jewish (8) , white…”
Section: Male Identifiersmentioning
confidence: 99%
“…Training MT-NLG was made feasible by numerous innovations and breakthroughs along all AI axes. Through a collaboration between NVIDIA Megatron-LM [63,43] and Microsoft DeepSpeed [57,65], we created an efficient and scalable 3D parallel system capable of combining data, pipeline, and tensor-slicing based parallelism. By combining tensor-slicing and pipeline parallelism, we can operate within the regime where they are most effective.…”
Section: Introductionmentioning
confidence: 99%
“…However, unlike CoCoNet, PyTorch's DDP requires extra memory for overlapping, which can increase training time for very large models [9] and do not support slicing of optimizer parameter update that significantly decrease memory usage. GPipe [26], Pipedream [38], and Narayanan et al [39] proposed pipeline training to improve model parallelism, by dividing the forward and backward pass into several mini-batches, which are then pipelined across devices. vPipe [53] improves these works by providing higher GPU utilization.…”
Section: Related Workmentioning
confidence: 99%
“…Transformer models Vaswani et al [2017] have attracted increasing interest and shown excellent performance in domains such as natural language processing (NLP) Vaswani et al [2017], Devlin et al [2019], Radford et al [2019], vision Dosovitskiy et al [2021] or graphs Ying et al [2021], Yun et al [2019]. Yet, their typically very high complexity (up to billions of parameters Narayanan et al [2021]) makes these models notoriously intransparent and their predictions inaccessible to the user. Since Transformer models have heavy application in potentially sensitive domains, e.g.…”
Section: Introductionmentioning
confidence: 99%