GPGPU Performance Estimation With Core and Memory Frequency Scaling

Wang, Qiang; Chu, Xiaowen

doi:10.1109/tpds.2020.3004623

Cited by 36 publications

(29 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When the core clock frequency is low enough for the application to be compute bound the execution time becomes dominated by computations and there is a direct dependency of the execution time on the core clock frequency. How the device memory latency is hidden and how it changes with the ratio of core and memory clock frequency is described in [40]. This is also supported by the analysis of the performance counters from NVVP which shows that an increase in the execution time at a particular critical frequency is due to the saturation of the number of issued instructions (see Fig.…”

Section: Discussionmentioning

confidence: 77%

“…The undervolting on GPUs was also explored by Mendes et al [43], where authors have achieved lower energy consumption without performance degradation and in some cases with better performance. Wang and Chu [40] introduce a fine-grained analytical model for estimation of the execution time of different GPU kernels. They have also investigated the memory latency and its dependency on core and memory frequency.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on GPUs for Real-Time Edge Computing

Adámek¹,

Novotný²,

Thiyagalingam³

et al. 2021

IEEE Access

View full text Add to dashboard Cite

The Square Kilometre Array (SKA) is an international initiative for developing the world's largest radio telescope with a total collecting area of over a million square meters. The scale of the operation, combined with the remote location of the telescope, requires the use of energy-efficient computational algorithms. This, along with the extreme data rates that will be produced by the SKA and the requirement for a real-time observing capability, necessitates in-situ data processing in an edge style computing solution. More generally, energy efficiency in the modern computing landscape is becoming of paramount concern. Whether it be the power budget that can limit some of the world's largest supercomputers, or the limited power available to the smallest Internet-of-Things devices. In this paper, we study the impact of hardware frequency scaling on the energy consumption and execution time of the Fast Fourier Transform (FFT) on NVIDIA GPUs using the cuFFT library. The FFT is used in many areas of science and it is one of the key algorithms used in radio astronomy data processing pipelines. Through the use of frequency scaling, we show that we can lower the power consumption of the NVIDIA A100 GPU when computing the FFT by up to 47% compared to the boost clock frequency, with less than a 10% increase in the execution time. Furthermore, using one common core clock frequency for all tested FFT lengths, we show on average a 43% reduction in power consumption compared to the boost core clock frequency with an increase in the execution time still below 10%. We demonstrate how these results can be used to lower the power consumption of existing data processing pipelines. These savings, when considered over years of operation, can yield significant financial savings, but can also lead to a significant reduction of greenhouse gas emissions.

show abstract

Section: Discussionmentioning

confidence: 77%

Section: Related Workmentioning

confidence: 99%

Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on GPUs for Real-Time Edge Computing

Adámek¹,

Novotný²,

Thiyagalingam³

et al. 2021

IEEE Access

View full text Add to dashboard Cite

show abstract

“…By following a similar approach Wang et al [6] proposed a DVFS-aware GPU performance model. The authors estimated the GPU architecture parameters using a collection of microbenchmarks and a group a performance counters, measured during their execution.…”

Section: Related Workmentioning

confidence: 99%

“…However, an efficient use of energy management techniques, such as DVFS, requires accurate models that can predict how the energy consumption changes with the GPU operating frequencies (and voltages). This type of modeling is often done by separately modeling the performance and the power consumption of the GPU, focusing on how each one separately scales with DVFS [6], [7]. On the other hand, several previous works have shown that the performance/power behavior of GPU applications considerably vary with the application characteristics [8], [9], which makes these predictive models to require some information from the application to provide accurate predictions.…”

Section: Introductionmentioning

confidence: 99%

GPU Static Modeling Using PTX and Deep Structured Learning

et al. 2019

View full text Add to dashboard Cite

In the quest for exascale computing, energy-efficiency is a fundamental goal in highperformance computing systems, typically achieved via dynamic voltage and frequency scaling (DVFS). However, this type of mechanism relies on having accurate methods of predicting the performance and power/energy consumption of such systems. Unlike previous works in the literature, this research focuses on creating novel GPU predictive models that do not require run-time information from the applications. The proposed models, implemented using recurrent neural networks, take into account the sequence of GPU assembly instructions (PTX) and can accurately predict changes in the execution time, power and energy consumption of applications when the frequencies of different GPU domains (core and memory) are scaled. Validated with 24 applications on GPUs from different NVIDIA microarchitectures (Turing, Volta, Pascal and Maxwell), the proposed models attain a significant accuracy. Particularly, the obtained power consumption scaling model provides an average error rate of 7.9% (Tesla T4), 6.7% (Titan V), 5.9% (Titan Xp) and 5.4% (GTX Titan X), which is comparable to state-of-the-art run-time counter-based models. When using the models to select the minimum-energy frequency configuration, significant energy savings can be attained: 8.0% (Tesla T4), 6.0% (Titan V), 29.0% (Titan Xp) and 11.5% (GTX Titan X).

show abstract

“…Hong and Kim present a simple analytical GPU model to estimate the execution time of GPU kernels, based on estimating the number of parallel memory requests, by considering the number of running threads and memory bandwidth. Wang and Chu provide an improved GPU performance estimation technique that also takes core and memory frequency scaling into account. Unfortunately, their model parameters were determined using microbenchmarks, which have become obsolete for newer generations of GPUs .…”

Section: Related Workmentioning

confidence: 99%

Dataflow management, dynamic load balancing, and concurrent processing for real‐time embedded vision applications using Quasar

Goossens

2018

Circuit Theory & Apps

View full text Add to dashboard Cite

Programming modern embedded vision systems brings various challenges, due to the steep learning curve for programmers and the different characteristics of the devices. Quasar, a new high-level programming language and development environment, considerably simplifies the development. Quasar has a compiler that detects and optimizes parallel programming patterns and a heterogeneous runtime that distributes the computational load over the available compute devices (CPUs and Graphical Processing Unit [GPUs]). In this paper, we focus on runtime aspects of Quasar. We show that with good approximation, the execution time of a GPU kernel function can be factorized in a compile-time-specific component and a runtime-specific component. We show that this approximation leads to a computationally simple runtime load balancing rule. Moreover, the load balancing rule permits efficient implicit concurrency of kernel functions and automatic scaling to multiple compute devices (eg, multi-CPU/GPU systems). Based on an appropriate mathematical scheduling model, we investigate the command queue size trade-off between memory usage and device utilization. The result is a programming environment for embedded vision systems for which automatic parallelization and implicit concurrency detection allow scaling the program efficiently to multi-CPU/GPU systems. Finally, benchmark results are provided to demonstrate the performance of our approach compared with OpenACC and CUDA (Compute Unified Device Architecture).

show abstract

GPGPU Performance Estimation With Core and Memory Frequency Scaling

Cited by 36 publications

References 43 publications

Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on GPUs for Real-Time Edge Computing

Efficiency Near the Edge: Increasing the Energy Efficiency of FFTs on GPUs for Real-Time Edge Computing

GPU Static Modeling Using PTX and Deep Structured Learning

Dataflow management, dynamic load balancing, and concurrent processing for real‐time embedded vision applications using Quasar

Contact Info

Product

Resources

About