2021
DOI: 10.1002/cpe.6544
|View full text |Cite
|
Sign up to set email alerts
|

Workflow provenance in the lifecycle of scientific machine learning

Abstract: Machine learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundly impacting the computational science and engineering domains, like geoscience, climate science, and health science. In these domains, users need to perform comprehensive data analyses combining scientific data and ML models to provide for critical requirements, such as reproducibility, model explainability, and experiment data understanding. However, scientific ML is multidisciplinary, heteroge… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 18 publications
(8 citation statements)
references
References 54 publications
0
3
0
Order By: Relevance
“…Figure 12 shows the results of the bandwidth for the three IO sizes: (a) 1 GB, (b) 10 GB, and (c) 80 GB. The xaxis shows the number of files [1,10,100, 1000] and the y-axis shows the measured bandwidth in MB/s. For each number of files we measure the read and write IO 5 times for both environments.…”
Section: Io Bandwidthmentioning
confidence: 99%
“…Figure 12 shows the results of the bandwidth for the three IO sizes: (a) 1 GB, (b) 10 GB, and (c) 80 GB. The xaxis shows the number of files [1,10,100, 1000] and the y-axis shows the measured bandwidth in MB/s. For each number of files we measure the read and write IO 5 times for both environments.…”
Section: Io Bandwidthmentioning
confidence: 99%
“…LifeSWS capitalizes on our previous experience in developing major systems for scientific applications such as: polystores with CloudMdSQL [17], workflows with OpenAlea [30], model management with Gypscie [35] [38], querying data across distributed services with DfAnalyzer [32] and Provlake [33], monitoring and debugging applications implemented in big data frameworks such as Apache Spark [12], and debugging workflows with BugDoc [18] and VersionClimber [29].…”
Section: The Centrality Of Workflowsmentioning
confidence: 99%
“…It provides APIs to access ML development tools, such as PyTorch, Scikit-learn and Tensorflow. Also motivated by the objective of providing a holistic view to support the lifecycle of scientific ML, ProvLake [33] is a provenance data management system capable of capturing, integrating, and querying data across multiple distributed services, programs, databases, stores, and computational workflows by leveraging provenance data.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, some researches targeted the distributed big data processing systems and proposed how to capture provenance (e.g., MapReduce [2,36], Spark [30] and Flink [41,42]). In addition, provenance has been also researched in the ML domain, and [39,48,54] focused on the provenance for the training phase of ML models.…”
Section: Related Workmentioning
confidence: 99%