A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters

Son, Jaewon; Yoo, Yonghyuk; Kim, Khu-rai; Kim, Young-Jae; Lee, Kwonyong; Park, Sungyoung

doi:10.3390/electronics10030350

Cited by 4 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To Speed up Hyper‐Boundary Enhancement in Deep Learning Bunches Jaewon Son et al 18 have expected groundwork research for quickening the optimization of hyperparameters in DL clusters. This paper's main goal was to quickly combine good hyperparameters during the initial training phase.…”

Section: Literature Surveymentioning

confidence: 99%

Containerized deep learning in agriculture: Orchestrating GoogleNet with Kubernetes on high performance computing

Hasan,

Khan

et al. 2024

Concurrency and Computation

View full text Add to dashboard Cite

SummarySmart Farming has become a cornerstone of modern agriculture, offering data‐driven insights and automation that optimize resource utilization and increase crop yields. The use of cutting‐edge technologies in agriculture has given rise to Smart Farming, which has transformed traditional farming practices into efficient, data‐driven operations. This paper explores the synergy between high‐performance computing (HPC) systems, Kubernetes orchestration, GoogleNet architecture, and containerization to redefine the future of farming. At the heart of this transformation lies the GoogleNet architecture, a deep learning powerhouse recognized for its efficiency and accuracy in image recognition tasks. The orchestration capabilities of Kubernetes, a versatile tool for managing containerized workloads efficiently on HPC clusters. Hence, in this work, we investigate the intricacies of deploying GoogleNet‐based deep learning models within containerized environments orchestrated by Kubernetes on HPC infrastructure. It explores resource optimization, scalability, security, and adaptability, all tailored to the unique demands of the agricultural domain to evaluate the effectiveness of the given technique it is compared with the existing techniques namely Hermes, Horus, CYBELE, and RZ‐SHAN. The attained ranges of proposed method of various measures of accuracy, precision, recall, and F1‐score are 98.65%, 97.45%, 97.87%, and 98.12% for the Pilot Wheat Ear dataset. Also, the processing time for the proposed approach is 181.50 and 120.2 m for the Pilot Wheat Ear Dataset and the Pilot soya bean farming dataset. The latency of the proposed approach attains a lower value of 1.5 and 1.1 s pilot soya bean farming dataset and Pilot Wheat Ear dataset. The experimental outcome demonstrates the efficiency of the proposed approaches to improve Smart Farming agriculture.

show abstract

Section: Literature Surveymentioning

confidence: 99%

Containerized deep learning in agriculture: Orchestrating GoogleNet with Kubernetes on high performance computing

Hasan,

Khan

et al. 2024

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Accuracy efficiency. Hermes [131] is a scheduler to expedite HPO workloads in GPU datacenters. It provides a container preemption mechanism to enable migration between DL jobs with minimal overhead.…”

Section: Hyperparameter Optimization Workloadsmentioning

confidence: 99%

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Gao¹,

Hu²,

Ye³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers. CCS Concepts: • General and reference → Surveys and overviews; • Computing methodologies → Machine learning; • Computer systems organization → Cloud computing.

show abstract

“…Singh and Chana 20 worked on an extensive survey of resource scheduling on cloud systems by covering several aspects. While some schedulers deal with only specific resources like GPU, [21][22][23] others consider many resources like CPU, GPU, memory, and more. There are several resource schedulers for cloud environments and big data systems such as; Kubernetes, 15 Mesos, 18 and Yarn.…”

Section: Related Workmentioning

confidence: 99%

Twister2 Cross‐platform resource scheduler for big data

Uyar

Gunduz

Kamburugamuve

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

Twister2 is an open-source big data hosting environment designed to process both batch and streaming data at scale. Twister2 runs jobs in both high-performance computing (HPC) and big data clusters. It provides a cross-platform resource scheduler to run jobs in diverse environments. Twister2 is designed with a layered architecture to support various clusters and big data problems. In this paper, we present the cross-platform resource scheduler of Twister2. We identify required services and explain implementation details. We present job startup delays for single jobs and multiple concurrent jobs in Kubernetes and OpenMPI clusters. We compare job startup delays for Twister2 and Spark at a Kubernetes cluster. In addition, we compare the performance of terasort algorithm on Kubernetes and bare metal clusters at AWS cloud.

show abstract

A GPU Scheduling Framework to Accelerate Hyper-Parameter Optimization in Deep Learning Clusters

Cited by 4 publications

References 13 publications

Containerized deep learning in agriculture: Orchestrating GoogleNet with Kubernetes on high performance computing

Containerized deep learning in agriculture: Orchestrating GoogleNet with Kubernetes on high performance computing

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Twister2 Cross‐platform resource scheduler for big data

Contact Info

Product

Resources

About