Time-Based Roofline for Deep Learning Performance Analysis

Wang, Yunsong; Yang, Charlene; Farrell, S.; Zhang, Yan; Kurth, Thorsten; Williams, Samuel

doi:10.1109/dls51937.2020.00007

Cited by 14 publications

(9 citation statements)

References 30 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Ren et al [48] proposed the first algorithm-hardware co-designing framework with a combination of weight pruning and quantization, to reduce performance overhead due to irregular sparsity. Wang et al [61] extended Roofline model into deep learning area and incorporated computational complexity and run-time into models, which made it possible to analyze code performance for deep learning applications systematically.…”

Section: Dynamic Neural Networkmentioning

confidence: 99%

E^2TAD: An Energy-Efficient Tracking-based Action Detector

Hu¹,

Wu²,

Miao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video action detection (spatio-temporal action localization) is usually the starting point for human-centric intelligent analysis of videos nowadays. It has high practical impacts for many applications across robotics, security, healthcare, etc. The two-stage paradigm of Faster R-CNN inspires a standard paradigm of video action detection in object detection, i.e., firstly generating person proposals and then classifying their actions. However, none of the existing solutions could provide fine-grained action detection to the "who-when-where-what" level. This paper presents a tracking-based solution to accurately and efficiently localize predefined key actions spatially (by predicting the associated target IDs and locations) and temporally (by predicting the time in exact frame indices). This solution won first place in the UAV-Video Track of 2021 Low-Power Computer Vision Challenge (LPCVC).

show abstract

Section: Dynamic Neural Networkmentioning

confidence: 99%

E^2TAD: An Energy-Efficient Tracking-based Action Detector

Hu¹,

Wu²,

Miao³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Most studies using memory profilers are based on high-level understanding of the individual DNN layers or analytical models such as the Roofline [30] (roofline analysis helps to visualize the limits imposed by the hardware, as well as to determine the main limiting factor -memory bandwidth or computational capacity -thus leading to an ideal roadmap of possible optimization steps [29]). These approaches do not capture the complex interaction between the CPU, memory, and accelerator devices.…”

Section: Introductionmentioning

confidence: 99%

Computational Resource Consumption in Convolutional Neural Network Training – A Focus on Memory

Barrio-Parra

Barrios

Denneulin

2021

JSFI

View full text Add to dashboard Cite

Deep neural networks (DNNs) have grown in popularity in recent years thanks to the increase in computing power and the size and relevance of data sets. This has made it possible to build more complex models and include more areas of research and application. At the same time, the amount of data generated during the training process of these models puts great pressure on the capacity and bandwidth of the memory subsystem and, as a direct consequence, has become one of the biggest bottlenecks for the scalability of neural networks. Therefore, the optimizing of the workloads produced by DNNs in the memory subsystem requires a detailed understanding of access to the memory and the interactions between the processor, accelerator devices, and the system memory hierarchy. However, contrary to what would be expected, most DNN profilers work at a high level, so they only perform an analysis of the model and individual layers of the network leaving aside the complex interactions between all the hardware components involved in the training. This article shows the characterization performed using a convolutional neural network implemented in the two most popular frameworks: TensorFlow and Pytorch. Likewise, the behavior of the component interactions is discussed by varying the batch size for two sets of synthetic data and showing the results obtained by the profiler created for the study. Moreover, the results obtained when evaluating the AlexNet version on TensorFlow and its similarity in behavior when using a basic CNN are included.

show abstract

“…To facilitate the Roofline study, a range of tools have sprung to life, for more accurate machine characterization such as the Empirical Roofline Toolkit (ERT) [7], [8], and for more streamlined methods to collect Roofline performance data using open-source tools or workflows [3], [9]- [11]. Other than tools development, there are also many studies on the application of the Roofline model in both traditional HPC [3], [12]- [14] and the new, emerging field of Machine Learning [3], [15], [16], and the extension and refinement of the model, such as instruction Roofline [17], Roofline scaling trajectories [18], performance portability based on Roofline [8], and power and energy Roofline [19], [20].…”

Section: Introductionmentioning

confidence: 99%

8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks

Yang

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Performance optimization can be a daunting task especially as the hardware architecture becomes more and more complex. This paper takes a kernel from the Materials Science code BerkeleyGW, and demonstrates a few performance analysis and optimization techniques. Despite challenges such as high register usage, low occupancy, complex data access patterns, and the existence of several long-latency instructions, we have achieved 3.7 TFLOP/s of double-precision performance on an NVIDIA V100 GPU, with 8 optimization steps. This is 55% of the theoretical peak, 6.7 TFLOP/s, at nominal frequency 1312 MHz, and 70% of the more customized peak based on our 58% FMA ratio, 5.3 TFLOP/s. An array of techniques used to analyze this OpenACC kernel and optimize its performance are shown, including the use of hierarchical Roofline performance model and the performance tool Nsight Compute. This kernel exhibits computational characteristics that are commonly seen in many high-performance computing (HPC) applications, and are expected to be very helpful to a general audience of HPC developers and computational scientists, as they pursue more performance on NVIDIA GPUs.

show abstract

Time-Based Roofline for Deep Learning Performance Analysis

Cited by 14 publications

References 30 publications

E^2TAD: An Energy-Efficient Tracking-based Action Detector

E^2TAD: An Energy-Efficient Tracking-based Action Detector

Computational Resource Consumption in Convolutional Neural Network Training – A Focus on Memory

8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks

Contact Info

Product

Resources

About