Performance Modelling of Deep Learning on Intel Many Integrated Core Architectures

Viebke, André; Pllana, Sabri; Memeti, Suejb; Kołodziej, Joanna

doi:10.1109/hpcs48598.2019.9188090

Cited by 3 publications

(3 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DNN performance models for different hardware. There exists prior work on performance models for DNN training on both GPUs [35,74,75] and CPUs [86], though only the works by Qi et al and Justus et al seem to support generic DNNs. As described above, Surfer is fundamentally different from these works because it takes a hybrid runtime-based approach when making execution time predictions.…”

Section: Related Workmentioning

confidence: 99%

A Runtime-Based Computational Performance Predictor for Deep Neural Network Training

Yu,

Gao,

Golikov

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep learning researchers and practitioners usually leverage GPUs to help train their deep neural networks (DNNs) faster. However, choosing which GPU to use is challenging both because (i) there are many options, and (ii) users grapple with competing concerns: maximizing compute performance while minimizing costs. In this work, we present a new practical technique to help users make informed and cost-efficient GPU selections: make performance predictions using the help of a GPU that the user already has. Our technique exploits the observation that, because DNN training consists of repetitive compute steps, predicting the execution time of a single iteration is usually enough to characterize the performance of an entire training process. We make predictions by scaling the execution time of each operation in a training iteration from one GPU to another using either (i) wave scaling, a technique based on a GPU's execution model, or (ii) pre-trained multilayer perceptrons. We implement our technique into a Python library called Surfer and find that it makes accurate iteration execution time predictions on ResNet-50, Inception v3, the Transformer, GNMT, and DCGAN across six different GPU architectures. Surfer currently supports PyTorch, is easy to use, and requires only a few lines of code.

show abstract

Section: Related Workmentioning

confidence: 99%

A Runtime-Based Computational Performance Predictor for Deep Neural Network Training

Yu,

Gao,

Golikov

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Andre Viebke [26] investigated performance prediction accuracy using three alternative CNN models on an Intel Xeon Phi Processor. These two parameterized performance models estimated training convolutional neural networks' execution time.…”

Section: Related Workmentioning

confidence: 99%

Performance Analysis of Distributed Deep Learning Frameworks in a Multi-GPU Environment

Tulasi

Han

Lloyd

et al. 2021

2021 20th International Conference on Ubiquitous Computing and Communications (IUCC/CIT/DSCI/SmartCNS)

View full text Add to dashboard Cite

Deep Learning frameworks, such as TensorFlow, MXNet, Chainer, provide many basic building blocks for designing effective neural network models for various applications (e.g. computer vision, speech recognition, natural language processing). However, run-time performance of these deep learning frameworks varies significantly even when training identical deep network models on the same GPUs. This study presents an experimental analysis and performance model for assessing deep learning models (Convolutional Neural Networks (CNNs), Multilayer Perceptrons (MLP), Autoencoder) on three frameworks: TensorFlow, MXNet, and Chainer, in a multi-GPU environment. We analyse factors that influence these frameworks' performance by computing the running time of each framework in our proposed model, taking load imbalance factor into account. The evaluation results highlight significiant differences in the scalability of the frameworks, and the importance of load balance in parallel distributed deep learning.

show abstract

“…layer fusion), they only serve as a lower-bound approximation of a layer's real-world performance. Recent benchmark suites take a multi-tier approach [8,30,53], whereby they provide a collection of benchmarks that cover both end-to-end model and layer benchmarking.…”

Section: Background and Related Workmentioning

confidence: 99%

XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs

Li,

Dakkak,

Xiong

et al. 2019

Preprint

View full text Add to dashboard Cite

The world sees a proliferation of machine learning/deep learning (ML) models and their wide adoption in different application domains recently. This has made the profiling and characterization of ML models an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible computing system to serve ML models with the desired latency, throughput, and energy requirements while maximizing resource utilization. Such an endeavor is challenging as the characteristics of an ML model depend on the interplay between the model, framework, system libraries, and the hardware (or the HW/SW stack). A thorough characterization requires understanding the behavior of the model execution across the HW/SW stack levels. Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack.This paper proposes a leveled profiling design that leverages existing profiling tools to perform across-stack profiling. The design does so in spite of the profiling overheads incurred from the profiling providers. We coupled the profiling capability with an automatic analysis pipeline to systematically characterize 65 stateof-the-art ML models. Through this characterization, we show that our across-stack profiling solution provides insights (which are difficult to discern otherwise) on the characteristics of ML models, ML frameworks, and GPU hardware.

show abstract

Performance Modelling of Deep Learning on Intel Many Integrated Core Architectures

Cited by 3 publications

References 23 publications

A Runtime-Based Computational Performance Predictor for Deep Neural Network Training

A Runtime-Based Computational Performance Predictor for Deep Neural Network Training

Performance Analysis of Distributed Deep Learning Frameworks in a Multi-GPU Environment

XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs

Contact Info

Product

Resources

About