CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Shen, Du; Song, Shuaiwen Leon; Li, Ang; Liu, Xu

doi:10.1145/3168831

Cited by 33 publications

(5 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is especially desired when porting traditional CPU-based HPC applications onto the new GPU-based exascale systems, such as Summit [6], Sierra [7] and Perlmutter [37]. As part of the community effort, we are planning to pursue these research directions in our future work with our past experience on GPU analytic modeling [38], [39], [40] and performance optimization [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51].…”

Section: Discussionmentioning

confidence: 99%

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Song

Chen

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

Self Cite

153

View full text Add to dashboard Cite

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

show abstract

Section: Discussionmentioning

confidence: 99%

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Song

Chen

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

Self Cite

153

View full text Add to dashboard Cite

show abstract

“…In addition, the overhead of obtaining these indicators is usually high. The second approach is based on code instrumentation, such as CUDAAdvisor [55], to measure statistics about the control flow. Although the code instrumentation has little performance effect than the former, it changes the actual behavior of GPU kernels and requires extra effort made by developers and the system administrator in code changes and maintenance.…”

Section: Discussion and Future Workmentioning

confidence: 99%

DRLCap: Runtime GPU Frequency Capping with Deep Reinforcement Learning

Wang,

Hao,

et al. 2024

IEEE Trans. Sustain. Comput.

View full text Add to dashboard Cite

Power and energy consumption is the limiting factor of modern computing systems. As the GPU becomes a mainstream computing device, power management for GPUs becomes increasingly important. Current works focus on GPU kernel-level power management, with challenges in portability due to architecture-specific considerations. We present DRLCAP, a general runtime power management framework intended to support power management across various GPU architectures. It periodically monitors system-level information to dynamically detect program phase changes and model the workload and GPU system behavior. This elimination from kernel-specific constraints enhances adaptability and responsiveness. The framework leverages dynamic GPU frequency capping, which is the most widely used power knob, to control the power consumption. DRLCAP employs deep reinforcement learning (DRL) to adapt to the changing of program phases by automatically adjusting its power policy through online learning, aiming to reduce the GPU power consumption without significantly compromising the application performance. We evaluate DRLCAP on three NVIDIA and one AMD GPU architectures. Experimental results show that DRLCAP improves prior GPU power optimization strategies by a large margin. On average, it reduces the GPU energy consumption by 22% with less than 3% performance slowdown on NVIDIA GPUs. This translates to a 20% improvement in the energy efficiency measured by the energy-delay product (EDP) over the NVIDIA default GPU power management strategy. For the AMD GPU architecture, DRLCAP saves energy consumption by 10%, on average, with a 4% percentage loss, and improves energy efficiency by 8%.

show abstract

“…Based on the stall analysis, it identifies inefficient software-hardware interactions and their root causes, thus helping make informed optimization decisions. CUDAAdvisor [37], built on top of LLVM, instrumentalizes application code on both the host and device sides. It conducts code-and data-centric profiling to identify performance bottlenecks arising from competition for cache resources and memory and control flow divergence.…”

Section: Related Workmentioning

confidence: 99%

Analyzing GPU Performance in Virtualized Environments: A Case Study

Belkhiri,

Dagenais

2024

Future Internet

View full text Add to dashboard Cite

The graphics processing unit (GPU) plays a crucial role in boosting application performance and enhancing computational tasks. Thanks to its parallel architecture and energy efficiency, the GPU has become essential in many computing scenarios. On the other hand, the advent of GPU virtualization has been a significant breakthrough, as it provides scalable and adaptable GPU resources for virtual machines. However, this technology faces challenges in debugging and analyzing the performance of GPU-accelerated applications. Most current performance tools do not support virtual GPUs (vGPUs), highlighting the need for more advanced tools. Thus, this article introduces a novel performance analysis tool that is designed for systems using vGPUs. Our tool is compatible with the Intel GVT-g virtualization solution, although its underlying principles can apply to many vGPU-based systems. Our tool uses software tracing techniques to gather detailed runtime data and generate relevant performance metrics. It also offers many synchronized graphical views, which gives practitioners deep insights into GVT-g operations and helps them identify potential performance bottlenecks in vGPU-enabled virtual machines.

show abstract

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Cited by 33 publications

References 36 publications

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

DRLCap: Runtime GPU Frequency Capping with Deep Reinforcement Learning

Analyzing GPU Performance in Virtualized Environments: A Case Study

Contact Info

Product

Resources

About