2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) 2021
DOI: 10.1109/saner50967.2021.00046
|View full text |Cite
|
Sign up to set email alerts
|

On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 22 publications
(10 citation statements)
references
References 23 publications
0
10
0
Order By: Relevance
“…There are some tools available for supporting monitoring data quality tools such as Great Expectations [37], Databand [38], and Dataform [39]. Additionally, MLFlow [40] and Data Version Control (DVC) [41] also support data collection and management for model experiments, datasets, and parameters. However, these frameworks do not support a holistic explainability approach by keeping track of multiaspect requirements discussed in Section III concerning endto-end ML.…”
Section: Discussion On Tooling and Integrationmentioning
confidence: 99%
“…There are some tools available for supporting monitoring data quality tools such as Great Expectations [37], Databand [38], and Dataform [39]. Additionally, MLFlow [40] and Data Version Control (DVC) [41] also support data collection and management for model experiments, datasets, and parameters. However, these frameworks do not support a holistic explainability approach by keeping track of multiaspect requirements discussed in Section III concerning endto-end ML.…”
Section: Discussion On Tooling and Integrationmentioning
confidence: 99%
“…While considering the stated demands on data quality during the various sections of the work on hand, no specific framework or quality management model for big data value chains, as introduced by the authors, is proposed. Barrak et al empirically analyzed 391 open-source projects which used Data Version Control (DVC) techniques with respect to coupling of software and DVC artifacts and their complexity evolution in [18]. Their empirical study concludes that using DVC versioning tools becomes a growing practice, even though there is a maintenance overhead.…”
Section: Related Work and State-of-the-artmentioning
confidence: 99%
“…As there is often a massive amount of frequently changing input data involved in ML projects, local data repositories (e.g., edge devices) often hold the actual data set. At the same time, the remote storage solutions only persist a hash of the current version [18].…”
Section: Data Versioningmentioning
confidence: 99%
“…There are generally four types of assets to manage in machine learning (ML) in order to achieve model reproducibility: resources (e.g., dataset and environment), software (e.g., source code), metadata (e.g., dependencies), and execution data (e.g., execution results) [43]. However, prior work [23,25,43] shows these assets should not be managed with the same toolsets (e.g., Git) used for source code [23]. Hence, new version management tools (e.g., DVC [12] and MLflow [1]) are specifically designed for managing ML assets.…”
Section: Introductionmentioning
confidence: 99%