Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Shi, Shaohuai; Wang, Qiang; Chu, Xiaowen

doi:10.1109/dasc/picom/datacom/cyberscitec.2018.000-4

Cited by 90 publications

(61 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other models limit the degree of communication to less frequent synchronization points while allowing the individual models to temporarily diverge. Gossip Learning [139] is built around the idea that models are mobile and perform independent random walks through the peer-to-peer network. Since this forms a data-and model-parallel processing framework, the models evolve dierently and need to be combined through ensembling.…”

Section: Topologiesmentioning

confidence: 99%

See 1 more Smart Citation

A Survey on Distributed Machine Learning

et al. 2020

View full text Add to dashboard Cite

The demand for articial intelligence has grown signicantly over the last decade and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, in order to increase the quality of predictions and render machine learning solutions feasible for more complex applications, a substantial amount of training data is required. Although small machine learning models can be trained with modest amounts of data, the input for training larger models such as neural networks grows exponentially with the number of parameters. Since the demand for processing training data has outpaced the increase in computation power of computing machinery, there is a need for distributing the machine learning workload across multiple machines, and turning the centralized into a distributed system. These distributed systems present new challenges, rst and foremost the ecient parallelization of the training process and the creation of a coherent model. This article provides an extensive overview of the current state-of-the-art in the eld by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning, discussing the techniques used for distributed machine learning, and providing an overview of the systems that are available. 3:3200x over conventional CPUs for an image recognition algorithm using a pretrained multilayer perceptron (MLP).An alternative to generic GPUs for acceleration is the use of Application Specic Integrated Circuits (ASICs) which implement specialized functions through a highly optimized design. In recent times, the demand for such chips has risen signicantly [100]. When applied to e.g. Bitcoin mining, ASICs have a signicant competitive advantage over GPUs and CPUs due to their high performance and power eciency [145]. Since matrix multiplications play a prominent role in many machine learning algorithms, these workloads are highly amenable to acceleration through ASICS. Google applied this concept in their Tensor Processing Unit (TPU) [129], which, as the name suggests, is an ASIC that specializes in calculations on tensors (n-dimensional arrays), and is designed to accelerate their Tensorow [1][2] framework, a popular building block for machine learning models. The most important component of the TPU is its Matrix Multiply unit based on a systolic array. TPUs use a MIMD (Multiple Instructions, Multiple Data) [51] architecture which, unlike GPUs, allows them to execute diverging branches eciently. TPUs are attached to the server system through the PCI Express bus. This provides them with a direct connection with the CPU which allows for a high aggregated bandwidth of 63GB/s (PCI-e5x16). Multiple TPUs can be used in a data center and the individual units can collaborate in a distributed setting. The benet of the TPU over regular CPU/GPU setups is not only its increased processing power but also its power eciency, which is important in large-scale applications due to the cost of energy and the lim...

show abstract

Section: Topologiesmentioning

confidence: 99%

“…Shi and Chu [139] shows Tensorow achieving about 50% eciency on 4-node, InniBandconnected cluster training of ResNet-50He et al [68], and about 75% eciency on GoogleNet [148], showing that the communication overhead plays an important role, and also depends on architecture of the neural network to optimize.…”

Section: Dianne (Distributed Articial Neural Network)mentioning

confidence: 99%

A Survey on Distributed Machine Learning

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Guignard et al [21] presented detailed characterization results of a set of archetypal state-of-the-art DL workloads to identify the performance bottlenecks and to guide the design of prospective acceleration platforms in a more effective manner. Shi et al [36] evaluated the performance of four state-of-the-art distributed DL frameworks over different GPU hardware environments. They built performance models of standard processes in training DNNs with SGD, and then benchmark the performance of the frameworks with three neural networks (i.e., AlexNet, GoogleNet and ResNet-50).…”

Section: Performance Measurement Of Deep Learningmentioning

confidence: 99%

Moving Deep Learning into Web Browser: How Far Can We Go?

Xiang

Zheng

et al. 2019

The World Wide Web Conference

View full text Add to dashboard Cite

Recently, several JavaScript-based deep learning frameworks have emerged, making it possible to perform deep learning tasks directly in browsers. However, little is known on what and how well we can do with these frameworks for deep learning in browsers. To bridge the knowledge gap, in this paper, we conduct the first empirical study of deep learning in browsers. We survey 7 most popular JavaScript-based deep learning frameworks, investigating to what extent deep learning tasks have been supported in browsers so far. Then we measure the performance of different frameworks when running different deep learning tasks. Finally, we dig out the performance gap between deep learning in browsers and on native platforms by comparing the performance of TensorFlow.js and TensorFlow in Python. Our findings could help application developers, deep-learning framework vendors and browser vendors to improve the efficiency of deep learning in browsers.

show abstract

“…Our study aims to understand the training performance of cheap transient servers that have dynamic availability, revocation patterns, and unit costs. In addition, these previous studies often focus on measuring training speed using the average time to process one minibatch [25], [26], [36]. While in this work, we consider multiple important performance metrics-including training time, cost, and accuracy-that could be impacted by training on transient servers.…”

Section: Related Workmentioning

confidence: 99%

Speeding up Deep Learning with Transient Servers

Walls

et al. 2019

2019 IEEE International Conference on Autonomic Computing (ICAC)

View full text Add to dashboard Cite

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable-e.g., for rapidly evaluating new model designs-they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using training clusters composed of cheaper transient GPU servers to get the benefits of distributed training without the high costs.We conduct the first large-scale empirical analysis, launching more than a thousand GPU servers of various capacities, aimed at understanding the characteristics of transient GPU servers and their impact on distributed training performance. Our study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9% monetary savings for some cluster configurations. We also identify a number of important challenges and opportunities for redesigning distributed training frameworks to be transient-aware. For example, the dynamic cost and availability characteristics of transient servers suggest the need for frameworks to dynamically change cluster configurations to best take advantage of current conditions.

show abstract

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

Cited by 90 publications

References 26 publications

A Survey on Distributed Machine Learning

A Survey on Distributed Machine Learning

Moving Deep Learning into Web Browser: How Far Can We Go?

Speeding up Deep Learning with Transient Servers

Contact Info

Product

Resources

About