2022
DOI: 10.1109/tie.2021.3095790
|View full text |Cite
|
Sign up to set email alerts
|

Cooperative Distributed GPU Power Capping for Deep Learning Clusters

Abstract: The recent GPU-based clusters that handle deep learning (DL) tasks have the features of GPU device heterogeneity, a variety of deep neural network (DNN) models, and high computational complexity. Thus, the traditional power capping methods for CPU-based clusters or small-scale GPU devices do not apply to the GPU-based clusters handling DL tasks. This paper develops a cooperative distributed GPU power capping (CD-GPC) system for GPU-based clusters, aiming to minimize the training completion time of invoked DL t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3

Relationship

2
4

Authors

Journals

citations
Cited by 7 publications
(3 citation statements)
references
References 25 publications
(43 reference statements)
0
1
0
Order By: Relevance
“…The continuous advancement of GPU architectures enhances the speed of DNN model training but also results in significantly higher energy consumption. Despite improvements in manufacturing processes, GPU devices continue to exhibit high absolute energy usage [18]. It is worth noting that even with the significant increase in energy usage, we may only observe marginal improvements in DNN model training performance, which depend on the specific DNN model types and characteristics of worker nodes.…”
Section: Introductionmentioning
confidence: 83%
“…The continuous advancement of GPU architectures enhances the speed of DNN model training but also results in significantly higher energy consumption. Despite improvements in manufacturing processes, GPU devices continue to exhibit high absolute energy usage [18]. It is worth noting that even with the significant increase in energy usage, we may only observe marginal improvements in DNN model training performance, which depend on the specific DNN model types and characteristics of worker nodes.…”
Section: Introductionmentioning
confidence: 83%
“…Now, we present a GPU core frequency-based performance model for DL jobs, utilizing a statistical modeling approach. This model is grounded in the relationship t ∝ 1 f (time is inversely proportional to frequency), where t represents the DL job processing time and f denotes the frequency value, as discussed in [17,18]. In this model, λ F i and λ B i are the performance model coefficients for feed-forward and back-propagation processes in DNN model training jobs, respectively.…”
Section: Deep Learning Job Modelmentioning
confidence: 99%
“…4.1.2 Model selection. According to some previous works 1,17 , we adopt some popular and classical DNN models as the evaluation models, including AlexNet, VGG11, VGG16, ResNet18, ResNet50 and DenseNet121.…”
Section: Outputmentioning
confidence: 99%