2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid) 2022
DOI: 10.1109/ccgrid54584.2022.00047
|View full text |Cite
|
Sign up to set email alerts
|

Scanflow-K8s: Agent-based Framework for Autonomic Management and Supervision of ML Workflows in Kubernetes Clusters

Abstract: Machine Learning (ML) projects are currently heavily based on workflows composed of some reproducible steps and executed as containerized pipelines to build or deploy ML models efficiently because of the flexibility, portability, and fast delivery they provide to the ML life-cycle. However, deployed models need to be watched and constantly managed, supervised, and debugged to guarantee their availability, validity, and robustness in unexpected situations. Therefore, containerized ML workflows would benefit fro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 7 publications
(8 citation statements)
references
References 13 publications
(32 reference statements)
0
8
0
Order By: Relevance
“…Our fine-grained scheduling approach for containerized HPC workloads is built over the existing Scanflow-Kubernetes platform [24] [25]. It is implemented both within a Scanflow(MPI) extension package in the application layer (see in Scanflow-Kubernetes github repository 1 ) and an enhanced Volcano scheduler/controller manager in the infrastructure layer (see in Volcano github repository 2 ).…”
Section: System Architecturementioning
confidence: 99%
“…Our fine-grained scheduling approach for containerized HPC workloads is built over the existing Scanflow-Kubernetes platform [24] [25]. It is implemented both within a Scanflow(MPI) extension package in the application layer (see in Scanflow-Kubernetes github repository 1 ) and an enhanced Volcano scheduler/controller manager in the infrastructure layer (see in Volcano github repository 2 ).…”
Section: System Architecturementioning
confidence: 99%
“…Given this landscape, any entity involved in the business of scaled computing will fall behind if these technological needs are not prioritized. 4 In cloud computing communities, machine learning workloads are also becoming increasingly important, [7][8][9][10] and the cloud container orchestration technology Kubernetes is becoming the de facto standard for orchestration of these workflows following its success orchestrating microservices. As of June of 2023, the Kubernetes project had over 74,000 contributors, making it the second largest open source project ever after Linux, and the "most widely used container orchestration platform in existence" (the CNCF project report).…”
Section: Introductionmentioning
confidence: 99%
“…ML applications focus on learning models from data and making predictions by using the trained model. From a runtime perspective, the ML training and batch ML inference could be executed as offline jobs and may take days to complete, whereas the online ML inference service is realized as a long-run service that is able to deal with dynamic prediction queries from end-users [97][98] [203][92]. These bring challenges in the platform layer for containers to consider and support the development, testing, deployment, and operation of wide range types of applications.…”
Section: Platform Opportunitiesmentioning
confidence: 99%
“…Our contributions to achieve this objective are as follows, and have resulted in publications [99][101] [97]. BD applications have also been containerized in this manner, as published formerly by our group [160], hence, this is not included as a contribution in this thesis:…”
Section: Objective 1: Enable Deployments Of Hpc Bd and ML Application...mentioning
confidence: 99%
See 1 more Smart Citation