Mixed Precision Training

Micikevicius, Paulius; Narang, Sharan; Alben, Jonah; Diamos, Gregory; Elsen, Erich; García, David; Ginsburg, Boris; Houston, Michael J.; Kuchaiev, Oleksii; Venkatesh, Ganesh; Wu, Hao

doi:10.48550/arxiv.1710.03740

Cited by 281 publications

(240 citation statements)

References 10 publications

Supporting

Mentioning

214

Contrasting

Order By: Relevance

“…We train the upsampling model for 1.6M iterations at batch size 512. We find that these models train stably with 16-bit precision and traditional loss scaling (Micikevicius et al, 2017). The total training compute is roughly equal to that used to train DALL-E.…”

Section: Text-conditional Diffusion Modelsmentioning

confidence: 89%

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol¹,

Dhariwal²,

Ramesh³

et al. 2021

Preprint

218

292

View full text Add to dashboard Cite

Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifierfree guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.

show abstract

Section: Text-conditional Diffusion Modelsmentioning

confidence: 89%

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol¹,

Dhariwal²,

Ramesh³

et al. 2021

Preprint

218

292

View full text Add to dashboard Cite

show abstract

“…LibriSpeech [18] Human 960 5760 Common Voice [19] Human 500 3000 Libri-Light [20] Model 60000 360000 Fisher [21] Human Following prior work on scaling Transformer models [1,10,11], we scale the encoder of an E2E VGG-transformer transducer model [12,13] up to 10B parameters. We leverage several techniques to train our transducer models efficiently on GPUs: FairScale model sharding [14], sparse alignment restricted transducer loss [15], mixed-precision training [16], and large batch sizes [17].…”

Section: Data Sourcementioning

confidence: 99%

Scaling ASR Improves Zero and Few Shot Learning

Xiao¹,

Zheng²,

Keren³

et al. 2021

Preprint

View full text Add to dashboard Cite

With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such as sparse transducer loss and model sharding. By training 1-10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains. Furthermore, our models learn powerful speech representations with zero and few-shot capabilities on novel domains and styles of speech, exceeding previous results across multiple in-house and public benchmarks. For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank test set, respectively, while realizing the best performance on public social media videos. Furthermore, the same universal model reaches equivalent performance with 500x less in-domain data on the SPGISpeech financial-domain dataset.

show abstract

“…It is well known that small floating point error does not dramatically affect the convergence and final accuracy of ML models [16,20,24,72]. This observation has motivated extensive prior research about training with low or mixed-precision FP operations [20,26,47,51,80,120] and compression or quantization [36,40,45,72].…”

Section: Characteristics Of Training Gradientsmentioning

confidence: 99%

“…It also lacks flexibility: it is tied to specific operations on specific floating-point formats. New ML-specific numeric representations (e.g., FP16 [80,109], bfloat16 [22,31,54], TF32 [87], and MSFP [19]) represent an area of ongoing innovation, and adding support for a new format requires developing and manufacturing a new ASIC -an expensive and time-consuming endeavor. For example, it took four years for Mellanox to release its second version of switches with floating point support [32,33].…”

Section: Introductionmentioning

confidence: 99%

Unlocking the Power of Inline Floating-Point Operations on Programmable Switches

Yang¹,

Alama²,

Sapio³

et al. 2021

Preprint

View full text Add to dashboard Cite

1 The advent of switches with programmable dataplanes has enabled the rapid development of new network functionality, as well as providing a platform for acceleration of a broad range of application-level functionality. However, existing switch hardware was not designed with application acceleration in mind, and thus applications requiring operations or datatypes not used in traditional network protocols must resort to expensive workarounds. Applications involving floating point data, including distributed training for machine learning and distributed query processing, are key examples.In this paper, we propose FPISA, a floating point representation designed to work efficiently in programmable switches. We first implement FPISA on an Intel Tofino switch, but find that it has limitations that impact throughput and accuracy. We then propose hardware changes to address these limitations based on the open-source Banzai switch architecture, and synthesize them in a 15-nm standard-cell library to demonstrate their feasibility. Finally, we use FPISA to implement accelerators for training for machine learning and for query processing, and evaluate their performance on a switch implementing our changes using emulation. We find that FPISA allows distributed training to use 25-75% fewer CPU cores and provide up to 85.9% better throughput in a CPU-constrained environment than SwitchML. For distributed query processing with floating point data, FPISA enables up to 2.7× better throughput than Spark.

show abstract

Mixed Precision Training

Cited by 281 publications

References 10 publications

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Scaling ASR Improves Zero and Few Shot Learning

Unlocking the Power of Inline Floating-Point Operations on Programmable Switches

Contact Info

Product

Resources

About