Deep Learning-based Job Placement in Distributed Machine Learning Clusters

Bao, Yixin; Peng, Yanghua; Wu, Chuan

doi:10.1109/infocom.2019.8737460

Cited by 105 publications

(47 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Gao et al [14] solve a training time minimization problem to find the best device placement of a deep neural network, using a reinforcement learning algorithm. Bao et al [7] propose a deep learning-based job placement algorithm to minimize interference among co-located ML jobs. Resource allocation among multiple jobs is not considered by these work.…”

Section: Related Workmentioning

confidence: 99%

“…subject to: This maximization problem involves integer variables, non-linear constraint (2b) (2c) and constraints concerning multiplication of variables (2f)(2h)(7b). To address these challenges, we first apply the compact-exponential techniques [36] to reformulate problem (7) into an equivalent conventional integer linear program (ILP) with packing structure: (8) subject to:…”

Section: The Maximum Weighted Schedule Problemmentioning

confidence: 99%

See 1 more Smart Citation

Online scheduling of heterogeneous distributed machine learning jobs

Zhang

Zhou

et al. 2020

Proceedings of the Twenty-First International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Netw

Self Cite

View full text Add to dashboard Cite

Distributed machine learning (ML) has played a key role in today's proliferation of AI services. A typical model of distributed ML is to partition training datasets over multiple worker nodes to update model parameters in parallel, adopting a parameter server architecture. ML training jobs are typically resource elastic, completed using various time lengths with different resource configurations. A fundamental problem in a distributed ML cluster is how to explore the demand elasticity of ML jobs and schedule them with different resource configurations, such that the utilization of resources is maximized and average job completion time is minimized. To address it, we propose an online scheduling algorithm to decide the execution time window, the number and the type of concurrent workers and parameter servers for each job upon its arrival, with a goal of minimizing the weighted average completion time. Our online algorithm consists of (i) an online scheduling framework that groups unprocessed ML training jobs into a batch iteratively, and (ii) a batch scheduling algorithm that configures each ML job to maximize the total weight of scheduled jobs in the current iteration. Our online algorithm guarantees a good parameterized competitive ratio with polynomial time complexity. Extensive evaluations using real-world data demonstrate that it outperforms state-of-the-art schedulers in today's AI cloud systems.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: The Maximum Weighted Schedule Problemmentioning

confidence: 99%

Online scheduling of heterogeneous distributed machine learning jobs

Zhang

Zhou

et al. 2020

Proceedings of the Twenty-First International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Netw

Self Cite

View full text Add to dashboard Cite

show abstract

“…The recent advance of RL has expedited automation of system operations in many areas. They include energy optimization in data centers [38], [39], cluster resource management [5]- [7], job placement in cloud networks [40], network slicing [41], and compiler optimization [42]. In this paper, we adopt RL for the GFPS problem, which is considered challenging in the area of real-time systems.…”

Section: Name Descriptionmentioning

confidence: 99%

Panda: Reinforcement Learning-Based Priority Assignment for Multi-Processor Real-Time Scheduling

Lee

Yeom

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Recently, deep reinforcement learning (RL) technologies have been considered as a feasible solution for tackling combinatorial problems in various research and engineering areas. Motivated by this recent success of RL-based approaches, in this paper, we focus on how to utilize RL technologies in the context of real-time system research. Specifically, we first formulate the problem of fixed-priority assignments for multi-processor real-time scheduling, which has long been considered challenging in the real-time system community, as a combinatorial problem. We then propose the RL-based priority assignment model Panda that employs (i) a taskset embedding mechanism driven by attention-based encoder-decoder deep neural networks, hence enabling to efficiently extract useful features from the dynamic relation of periodic tasks. We also present two optimization schemes tailored to adopt RL for real-time task scheduling problems: (ii) the response time analysis (RTA)-based policy gradient RL and guided learning schemes, which facilitate the training processes of the Panda model. To the best of our knowledge, our approach is the first to employ RL for real-time task scheduling. Through various experiments, we show that Panda is competitive with well-known heuristic algorithms for real-time task scheduling upon a multi-processor platform, and it often outperforms them in large-scale nontrivial settings, e.g., achieving an average 7.7% enhancement in schedulability ratio for a testing system configuration of 64-sized tasksets and an 8-processor platform.

show abstract

“…In output layer of the policy NN, we mask invalid actions, which points to a direction of obstacles within one meter from the walker, by setting their probability to 0 in the probability distribution. Then we re-scale the probabilities of all actions such that the sum still equals 1 (Bao et al, 2019). The walker will then move one meter toward the chosen direction.…”

Section: Action Spacementioning

confidence: 99%

A Smart Robotic Walker With Intelligent Close-Proximity Interaction Capabilities for Elderly Mobility Safety

et al. 2020

Self Cite

View full text Add to dashboard Cite

The elderly population has rapidly increased in past years, bringing huge demands for elderly serving devices, especially for those with mobility impairment. Present assistant walkers designed for elderly users are primitive with limited user interactivity and intelligence. We propose a novel smart robotic walker that targets a convenient-to-use indoor walking aid for the elderly. The walker supports multiple modes of interactions through voice, gait or haptic touch, and allows intelligent control via learning-based methods to achieve mobility safety. Our design enables a flexible, initiative and reliable walker due to the following: (1) we take a hybrid approach by combining the conventional mobile robotic platform with the existing rollator design, to achieve a novel robotic system that fulfills expected functionalities; (2) our walker tracks users in front by detecting lower limb gait, while providing close-proximity walking safety support; (3) our walker can detect human intentions and predict emergency events, e.g., falling, by monitoring force pressure on a specially designed soft-robotic interface on the handle; (4) our walker performs reinforcement learning-based sound source localization to locate and navigate to the user based on his/her voice signals. Experiment results demonstrate the sturdy mechanical structure, the reliability of multiple novel interactions, and the efficiency of the intelligent control algorithms implemented. The demonstration video is available at: https://sites.google.com/view/smart-walker-hku.

show abstract

Deep Learning-based Job Placement in Distributed Machine Learning Clusters

Cited by 105 publications

References 21 publications

Online scheduling of heterogeneous distributed machine learning jobs

Online scheduling of heterogeneous distributed machine learning jobs

Panda: Reinforcement Learning-Based Priority Assignment for Multi-Processor Real-Time Scheduling

A Smart Robotic Walker With Intelligent Close-Proximity Interaction Capabilities for Elderly Mobility Safety

Contact Info

Product

Resources

About