Performance Analysis of GPU-Based Convolutional Neural Networks

Li, Xiaqing; Zhang, Guangyan; Huang, Hua; Wang, Zhufan; Zheng, Weimin

doi:10.1109/icpp.2016.15

Cited by 105 publications

(43 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table (I) contains a summary of all the convolutional parameters described so far. One of the challenges with convolutions is that they are computationally intensive operations, taking up 86% to 94% of execution time for CNNs [1]. For heavy workloads, convolutions are typically run on graphical processing units (GPUs), as they are able to perform many mathematical operations in parallel.…”

Section: Ii1 Convolutions Backgroundmentioning

confidence: 99%

See 1 more Smart Citation

Digital Electronics and Analog Photonics for Convolutional Neural Networks (DEAP-CNNs)

Bangari

Márquez

Miller

et al. 2020

IEEE J. Select. Topics Quantum Electron.

143

112

View full text Add to dashboard Cite

Convolutional Neural Networks (CNNs) are powerful and highly ubiquitous tools for extracting features from large datasets for applications such as computer vision and natural language processing. However, a convolution is a computationally expensive operation in digital electronics. In contrast, neuromorphic photonic systems, which have experienced a recent surge of interest over the last few years, propose higher bandwidth and energy efficiencies for neural network training and inference. Neuromorphic photonics exploits the advantages of optical electronics, including the ease of analog processing, and busing multiple signals on a single waveguide at the speed of light. Here, we propose a Digital Electronic and Analog Photonic (DEAP) CNN hardware architecture that has potential to be 2.8 to 14 times faster while maintaining the same power usage of current state-of-the-art GPUs.

show abstract

Section: Ii1 Convolutions Backgroundmentioning

confidence: 99%

“…One of the primary bottlenecks is computing the matrix multiplication required for forward propagation. In fact, over 80% of the total processing time is spent on the convolution [1]. Therefore, techniques that improve the efficiency of even forward-only propagation are in high demand and researched extensively [2,3].…”

Section: Introductionmentioning

confidence: 99%

Digital Electronics and Analog Photonics for Convolutional Neural Networks (DEAP-CNNs)

Bangari

Márquez

Miller

et al. 2020

IEEE J. Select. Topics Quantum Electron.

143

112

View full text Add to dashboard Cite

show abstract

“…Nvprof provides us with information related to the type of kernels running on the GPU, GPU utilization and other metrics. Our work differs from prior works that have used GPU based profiling tools such as nvprof to analyse the performance of ConvNets [47] or existing performance benchmark on desktop GPUs [2], where we restrict our studies to fine-grained energy and performance measurements on the CPUs.…”

Section: Power Sampling Methodsmentioning

confidence: 99%

Fine-grained energy profiling for deep convolutional neural networks on the Jetson TX1

Rodrigues

Riley

Luján

2017

2017 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

There is a huge demand for on-device execution of deep learning algorithms on mobile and embedded platforms. These devices present constraints on the application due to limited resources and power. Hence, developing energy-efficient solutions to address this issue will require innovation in algorithmic design, software and hardware. Such innovation requires benchmarking and characterization of Deep Neural Networks based on performance and energy-consumption alongside accuracy. However, current benchmarks studies in existing deep learning frameworks (for example, Caffe, Tensorflow, Torch and others) are based on performance of these applications on high-end CPUs and GPUs. In this work, we introduce a benchmarking framework called "SyNERGY" to measure the energy and time of 11 representative Deep Convolutional Neural Networks on embedded platforms such as NVidia Jetson TX1. We integrate ARM's Streamline Performance Analyser with standard deep learning frameworks such as Caffe and CuDNNv5, to study the execution behaviour of current deep learning models at a fine-grained level (or specific layers) on image processing tasks. In addition, we build an initial multi-variable linear regression model to predict energy consumption of unseen neural network models based on the number of SIMD instructions executed and main memory accesses of the CPU cores of the TX1 with an average relative test error rate of 8.04 ± 5.96%. Surprisingly, we find that it is possible to refine the model to predict the number of SIMD instructions and main memory accesses solely from the application's Multiply-Accumulate (MAC) counts, hence, eliminating the need for actual measurements. Our predicted results demonstrate 7.08 ± 6.0 % average relative error over actual energy measurements of all 11 networks tested, except MobileNet. By including MobileNet the average relative test error increases to 17.33 ± 12.2 %.

show abstract

“…There are several approaches to compute the Convolution operation [6][7][8][9][10][11][12]. Fast Fourier transformation (FFT), Winograd minimal filtering algorithm, the look-up table and matrix multiplication-based convolution are a few of them.…”

Section: Introductionmentioning

confidence: 99%

“…This algorithm reduces the arithmetic complexity of the convolutional layer by using a minimal filtering technique. These approaches to compute the convolution can further be optimized by using different techniques and schemes [12][13][14].…”

Section: Introductionmentioning

confidence: 99%

Optimized Deep Neural Networks for Real-Time Object Classification on Embedded GPUs

2017

View full text Add to dashboard Cite

Abstract:Convolution is the most computationally intensive task of the Convolutional Neural Network (CNN). It requires a lot of memory storage and computational power. There are different approaches to compute the solution of convolution and reduce its computational complexity. In this paper, a matrix multiplication-based convolution (ConvMM) approach is fully parallelized using concurrent resources of GPU (Graphics Processing Unit) and optimized, considerably improving the performance of the image classifiers and making them applicable to real-time embedded applications. The flow of this CUDA (Compute Unified Device Architecture)-based scheme is optimized using unified memory and hardware-dependent acceleration of matrix multiplication. Proposed flow is evaluated on two different embedded platforms: first on an Nvidia Jetson TX1 embedded board and then on a Tegra K1 GPU of an Nvidia Shield K1 Tablet. The performance of this optimized and accelerated convolutional layer is compared with its sequential and heterogeneous versions. Results show that the proposed scheme significantly improves the overall results including energy efficiency, storage requirement and inference performance. In particular, the proposed scheme on embedded GPUs is hundreds of times faster than the sequential version and delivers tens of times higher performance than the heterogeneous approach.

show abstract

Performance Analysis of GPU-Based Convolutional Neural Networks

Cited by 105 publications

References 6 publications

Digital Electronics and Analog Photonics for Convolutional Neural Networks (DEAP-CNNs)

Digital Electronics and Analog Photonics for Convolutional Neural Networks (DEAP-CNNs)

Fine-grained energy profiling for deep convolutional neural networks on the Jetson TX1

Optimized Deep Neural Networks for Real-Time Object Classification on Embedded GPUs

Contact Info

Product

Resources

About