Fast Deep Neural Network Training on Distributed Systems and Cloud TPUs

You, Yang; Zhang, Zhao; Hsieh, Cho‐Jui; Demmel, James; Keutzer, Kurt

doi:10.1109/tpds.2019.2913833

Cited by 52 publications

(21 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…When the batch size increases, the hardware utilization also increases, and the number of iterations for training decreases, so the training time is accelerated [19]. However, a large batch size reduces accuracy so it should be mitigated [3].…”

Section: A Training Large-scale Dnn Modelsmentioning

confidence: 99%

Accelerating Distributed SGD With Group Hybrid Parallelism

Joo

Youn

2021

IEEE Access

View full text Add to dashboard Cite

The scale of model parameters and datasets is rapidly growing for high accuracy in various areas. To train a large-scale deep neural network (DNN) model, a huge amount of computation and memory is required; therefore, a parallelization technique for training large-scale DNN models has attracted attention. A number of approaches have been proposed to parallelize large-scale DNN models, but these schemes lack scalability because of their long communication time and limited worker memory. They often sacrifice accuracy to reduce communication time. In this work, we proposed an efficient parallelism strategy named group hybrid parallelism (GHP) to minimize the training time without any accuracy loss. Two key ideas inspired our approach. First, grouping workers and training them by groups reduces unnecessary communication overhead among workers. It saves a huge amount of network resources in the course of training large-scale networks. Second, mixing data and model parallelism can reduce communication time and mitigate the worker memory issue. Data and model paralleism are complementary to each other so the training time can be enhanced when they are combined. We analyzed the training time model of the data and model parallelism, and based on the training time model, we demonstrated the heuristics that determine the parallelization strategy for minimizing training time. We evaluated group hybrid parallelism in comparison with existing parallelism schemes, and our experimental results show that group hybrid parallelism outperforms them.

show abstract

Section: A Training Large-scale Dnn Modelsmentioning

confidence: 99%

Accelerating Distributed SGD With Group Hybrid Parallelism

Joo

Youn

2021

IEEE Access

View full text Add to dashboard Cite

show abstract

“…We use 1e −5 as the weight for L 2 regularization. We train with a batch size of 4096, using a dropout of 0.3 on 32 TPU (You et al, 2019) cores.…”

Section: Training Detailsmentioning

confidence: 99%

Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage

Thapliyal

Soricut

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Cross-modal language generation tasks such as image captioning are directly hurt in their ability to support non-English languages by the trend of data-hungry models combined with the lack of non-English annotations. We investigate potential solutions for combining existing language-generation annotations in English with translation capabilities in order to create solutions at web-scale in both domain and language coverage. We describe an approach called Pivot-Language Generation Stabilization (PLuGS), which leverages directly at training time both existing English annotations (gold data) as well as their machinetranslated versions (silver data); at run-time, it generates first an English caption and then a corresponding target-language caption. We show that PLuGS models outperform other candidate solutions in evaluations performed over 5 different target languages, under a largedomain testset using images from the Open Images dataset. Furthermore, we find an interesting effect where the English captions generated by the PLuGS models are better than the captions generated by the original, monolingual English model.

show abstract

“…In July 2018, Google announced edge TPUs designed for neural networks inference and training on edge computing [39]. They give high performance under small physical and power limitation.…”

Section: Edge Tpusmentioning

confidence: 99%

A Survey Comparing Specialized Hardware And Evolution In TPUs For Neural Networks

Shahid¹,

Mushtaq²

2020

2020 IEEE 23rd International Multitopic Conference (INMIC)

View full text Add to dashboard Cite

This survey paper is based on the evolution of TPUs from first generation TPUs to edge TPUs and their architectures. This paper compares CPUs, GPUs, FPGAs and TPUs, their hardware architectures, their similarities and differences will be discussed. Modern neural networks are immensely used these days but they require more time, computation and energy. Due to the greater demand and attractive options for architects to explore, companies are continuously working to reduce training and inference response time. Due to the demands and cost factors different kinds of ASICs (application specific integrated circuits) are developed and research is increased in this area. Many models of CPUs, GPUs and TPUs have been developed to support these networks and to improve training and inference phase. Intel developed CPUs for this purpose, NVIDIA developed GPUs and Google developed cloud TPUs. The hardware of CPUs and GPUs can be sold to businesses while Google offers TPU processing for everyone from the cloud. When the data is away from the computational source, it increases the overall cost and to reduce this cost companies implements memory management and caching techniques close to ALUs.

show abstract

Fast Deep Neural Network Training on Distributed Systems and Cloud TPUs

Cited by 52 publications

References 15 publications

Accelerating Distributed SGD With Group Hybrid Parallelism

Accelerating Distributed SGD With Group Hybrid Parallelism

Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage

A Survey Comparing Specialized Hardware And Evolution In TPUs For Neural Networks

Contact Info

Product

Resources

About