Cooperative Distributed GPU Power Capping for Deep Learning Clusters

Kang, Dong-Ki; Ha, Yun-Gi; Peng, Limei; Youn, Chan‐Hyun

doi:10.1109/tie.2021.3095790

Cited by 7 publications

(3 citation statements)

References 25 publications

(43 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The continuous advancement of GPU architectures enhances the speed of DNN model training but also results in significantly higher energy consumption. Despite improvements in manufacturing processes, GPU devices continue to exhibit high absolute energy usage [18]. It is worth noting that even with the significant increase in energy usage, we may only observe marginal improvements in DNN model training performance, which depend on the specific DNN model types and characteristics of worker nodes.…”

Section: Introductionmentioning

confidence: 83%

Energy-Efficient and Timeliness-Aware Continual Learning Management System

Kang

2023

Energies

Self Cite

View full text Add to dashboard Cite

Continual learning has recently become a primary paradigm for deep neural network models in modern artificial intelligence services, where streaming data patterns frequently and irregularly change over time in dynamic environments. Unfortunately, there is still a lack of studies on computing cluster management for the processing of continual learning tasks, particularly in terms of the timeliness of model updates and associated energy consumption. In this paper, we propose a novel timeliness-aware continual learning management (TA-CLM) system aimed at ensuring timely deep neural network model updates for continual learning tasks while minimizing the energy consumption of computing worker nodes in clusters. We introduce novel penalty cost functions to penalize quantitatively deep neural network model update latency and present the associated optimization formulation to ensure the best task allocation. Additionally, we design a simulated annealing-based optimizer, which is a meta-heuristic technique and easy to implement, to solve the non-convex and non-linear optimization problem. We demonstrate that the proposed TA-CLM system improves both latency and energy performance over its competitors by an average of 51.3% and 51.6%, respectively, based on experimental results using raw data from well-known deep neural network models on an NVIDIA GPU-based testbed and a large-scale simulation environment.

show abstract

Section: Introductionmentioning

confidence: 83%

Energy-Efficient and Timeliness-Aware Continual Learning Management System

Kang

2023

Energies

Self Cite

View full text Add to dashboard Cite

show abstract

“…Now, we present a GPU core frequency-based performance model for DL jobs, utilizing a statistical modeling approach. This model is grounded in the relationship t ∝ 1 f (time is inversely proportional to frequency), where t represents the DL job processing time and f denotes the frequency value, as discussed in [17,18]. In this model, λ F i and λ B i are the performance model coefficients for feed-forward and back-propagation processes in DNN model training jobs, respectively.…”

Section: Deep Learning Job Modelmentioning

confidence: 99%

Renewable-Aware Frequency Scaling Approach for Energy-Efficient Deep Learning Clusters

Park,

Kang

2024

Applied Sciences

Self Cite

View full text Add to dashboard Cite

Recently, renewable energy has emerged as an attractive means to reduce energy consumption costs for deep learning (DL) job processing in modern GPU-based clusters. In this paper, we propose a novel Renewable-Aware Frequency Scaling (RA-FS) approach for energy-efficient DL clusters. We have developed a real-time GPU core and memory frequency scaling method that finely tunes the training performance of DL jobs while maximizing renewable energy utilization. We introduce quantitative metrics: Deep Learning Job Requirement (DJR) and Deep Learning Job Completion per Slot (DJCS) to accurately evaluate the service quality of DL job processing. Additionally, we present a log-transformation technique to convert our non-convex optimization problem into a solvable one, ensuring the rigorous optimality of the derived solution. Through experiments involving deep neural network (DNN) model training jobs such as SqueezeNet, PreActResNet, and SEResNet on NVIDIA GPU devices like RTX3060, RTX3090, and RTX4090, we validate the superiority of our RA-FS approach. The experimental results show that our approach significantly improves performance requirement satisfaction by about 71% and renewable energy utilization by about 31% on average, compared to recent competitors.

show abstract

“…4.1.2 Model selection. According to some previous works 1,17 , we adopt some popular and classical DNN models as the evaluation models, including AlexNet, VGG11, VGG16, ResNet18, ResNet50 and DenseNet121.…”

Section: Outputmentioning

confidence: 99%

Offline performance and energy consumption prediction model of deep learning training tasks

Han

Chen

2023

International Conference on Computer Application and Information Security (ICCAIS 2022)

View full text Add to dashboard Cite

Today, deep learning technology is widely used in a lot of scientific research fields. A crucial question is how to accurately predict the performance and energy consumption of deep learning training (DLT) tasks. Existing prediction methods of DLT tasks either have low accuracy or use too many cluster resources, and few methods focus on the energy consumption prediction. In this paper, we analyze the relationships between the characteristics of performance and energy consumption and the task configurations of DLT tasks. Then we propose an offline prediction model to predict the performance and energy consumption of DLT tasks based on these relationships. The experiment in an actual GPU cluster shows the effectiveness of the prediction model. The average deviation of the prediction model is 4.68%.

show abstract

Cooperative Distributed GPU Power Capping for Deep Learning Clusters

Cited by 7 publications

References 25 publications

Energy-Efficient and Timeliness-Aware Continual Learning Management System

Energy-Efficient and Timeliness-Aware Continual Learning Management System

Renewable-Aware Frequency Scaling Approach for Energy-Efficient Deep Learning Clusters

Offline performance and energy consumption prediction model of deep learning training tasks

Contact Info

Product

Resources

About