An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

Awan, Ammar Ahmad; Subramoni, Hari; Panda, Dhabaleswar K.

doi:10.1145/3146347.3146356

Cited by 56 publications

(23 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The individual benchmarks are described below: Figure 6 and Figure 7 show the complete source code. The matrix multiplication and convolution kernels were selected for their dominance of the training and inference time of the most classical networks [4,75]. The other kernels bring interesting computation patterns to enable expressiveness and performance comparisons in more diverse network architectures.…”

Section: Performance Resultsmentioning

confidence: 99%

The Next 700 Accelerated Layers

Vasilache

Zinenko

Theodoridis

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: (1) a domain-specific language with a tensor notation close to the mathematics of deep learning; (2) a Just-In-Time optimizing compiler based on the polyhedral framework; (3) carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels; (4) the transparent integration of our flow into PyTorch and Caffe2, providing the fully automatic synthesis of high-performance GPU kernels from simple tensor algebra. The performance is comparable to, and often exceeds the performance of, highly tuned libraries. CCS Concepts: • Software and its engineering → Compilers;

show abstract

Section: Performance Resultsmentioning

confidence: 99%

The Next 700 Accelerated Layers

Vasilache

Zinenko

Theodoridis

et al. 2019

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Several existing research efforts have shown the impact of different hardware platforms on the performance of DL frameworks [8], [10], [11], [26], and compared the performance of different DL frameworks with respect to their DNN structures and their default configuration settings [9], [27]. Thus, in this paper, we engage our empirical measurement study and comparison on characterization and analysis of DL frameworks in terms of how they respond to different configurations of their hyper-parameters, different types of datasets and different choices of parallel computing libraries.…”

Section: Methodology and Baselinesmentioning

confidence: 99%

“…It is widely recognized that choosing the right DL framework for the right applications becomes a daunting task for many researchers, developers and domain scientists. Although there are some existing DL benchmarking efforts, most of them have centered on studying different CPU-GPU configurations and their impact on different DL frameworks with standard datasets [8], [9], [10], [11]. Even under the same CPU-GPU configuration, no single DL framework dominates the performance and accuracy for standard datasets, such as MNIST [12], CIFAR [13], ImageNet [14].…”

Section: Introductionmentioning

confidence: 99%

A Comparative Measurement Study of Deep Learning as a Service Framework

Ling

et al. 2022

IEEE Trans. Serv. Comput.

View full text Add to dashboard Cite

Big data powered Deep Learning (DL) and its applications have blossomed in recent years, fueled by three technological trends: a large amount of digitized data openly accessible, a growing number of DL software frameworks in open source and commercial markets, and a selection of affordable parallel computing hardware devices. However, no single DL framework, to date, dominates in terms of performance and accuracy even for baseline classification tasks on standard datasets, making the selection of a DL framework an overwhelming task. This paper takes a holistic approach to conduct empirical comparison and analysis of four representative DL frameworks with three unique contributions. First, given a selection of CPU-GPU configurations, we show that for a specific DL framework, different configurations of its hyper-parameters may have a significant impact on both performance and accuracy of DL applications. Second, to the best of our knowledge, this study is the first to identify the opportunities for improving the training time performance and the accuracy of DL frameworks by configuring parallel computing libraries and tuning individual and multiple hyper-parameters. Third, we also conduct a comparative measurement study on the resource consumption patterns of four DL frameworks and their performance and accuracy implications, including CPU and memory usage, and their correlations to varying settings of hyper-parameters under different configuration combinations of hardware, parallel computing libraries. We argue that this measurement study provides in-depth empirical comparison and analysis of four representative DL frameworks, and offers practical guidance for service providers to deploying and delivering DL as a Service (DLaaS) and for application developers and DLaaS consumers to select the right DL frameworks for the right DL workloads.

show abstract

“…In consequence, the intelligence services are mostly running in remote servers and not directly in the devices where the applications are executed. At the same time, the data required to train the models are centrally collected, and in turn facilitating the training process of the models used by the intelligent services [6].…”

Section: Intelligent Services Decouplingmentioning

confidence: 99%

Intelligence Stratum for IoT. Architecture Requirements and Functions

Ramos

Morabito

2019

2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf

View full text Add to dashboard Cite

The use of Artificial Intelligence (AI) is becoming increasingly pervasive and relevant in many different application areas. Researchers are putting a considerable effort to take full advantage of the power of AI, while trying to overcome the technical challenges that are intrinsically linked to almost any domain area of application, such as the Internet of Things (IoT). One of the biggest problems related to the use of AI in IoT is related to the difficulty of coping with the wide variety of protocols and software technologies used, as well as with the heterogeneity of the hardware resources consuming the AI. The scattered IoT landscape accentuates the limitations on interoperability, especially visible in the deployment of AI, affecting the seamless AI life-cycle management as well. In this paper, it is discussed how to enable AI distribution in IoT by introducing a layered intelligence architecture that aims to face the undertaken challenges taking into account the special requirements of nowadays IoT networks. It describes the main characteristics of the new paradigm architecture, highlighting what are the implications of its adoption from use cases perspective and their requirements. Finally, a set of open technical and research challenges are enumerated to reach the full potential of the intelligence distribution's vision.This article has been accepted for publication in the 17th IEEE International Conference on Pervasive Intelligence and Computing (PICom 2019) © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

show abstract

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

Cited by 56 publications

References 3 publications

The Next 700 Accelerated Layers

The Next 700 Accelerated Layers

A Comparative Measurement Study of Deep Learning as a Service Framework

Intelligence Stratum for IoT. Architecture Requirements and Functions

Contact Info

Product

Resources

About