Capturing Data Provenance from Statistical Software

Alter, George; Gager, Jack; Heus, Pascal; Hunter, Carson; Ionescu, Sanda; Iverson, Jeremy; Jagadish, H. V.; Lyle, Jared; Mueller, Alexander; Nordgaard, Sigve; Risnes, Ørnulf; Smith, Dan; Song, Jié

doi:10.2218/ijdc.v16i1.763

Cited by 3 publications

(2 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Provenance data is either extracted from software systems after [43] or during their operation [24,27,47]. There is active work towards recording provenance information without instrumenting the system or process [3,5,18,39].…”

Section: Related Workmentioning

confidence: 99%

Towards Specificationless Monitoring of Provenance-Emitting Systems

Stoffers

Weinert

2022

Runtime Verification

View full text Add to dashboard Cite

Monitoring often requires insight into the monitored system as well as concrete specications of expected behavior. More and more systems, however, provide information about their inner procedures by emitting provenance information in a W3C-standardized graph format. In this work, we present an approach to monitor such provenance data for anomalous behavior by performing spectral graph analysis on slices of the constructed provenance graph and by comparing the characteristics of each slice with those of a sliding window over recently seen slices. We argue that this approach not only simplies the monitoring of heterogeneous distributed systems, but also enables applying a host of well-studied techniques to monitor such systems.

show abstract

Section: Related Workmentioning

confidence: 99%

Towards Specificationless Monitoring of Provenance-Emitting Systems

Stoffers

Weinert

2022

Runtime Verification

View full text Add to dashboard Cite

show abstract

“…The general capability however fits into a broader context of other provenance or data pipeline research. This includes initiatives such as C2Metadata (Alter et al, 2021), which focus on a language independent representation of a data pipeline, and R packages such as targets (Landau, 2021) which focus on documenting pipeline code, and managing the execution of a pipeline, or RDataTracker which focusses on tracking the execution of a arbitrary R script (Lerner et al, 2018). dtrackr takes a more data oriented approach, which could be complementary, in which we remain agnostic to the detail of a data pipeline script or nature of its execution, but capture a subset of the transformations applied to data alongside the data itself, thereby documenting the data state as it is being manipulated.…”

mentioning

confidence: 99%

dtrackr: An R package for tracking the provenance of data

Challen¹

2022

JOSS

View full text Add to dashboard Cite

An accurate statement of the provenance of data is essential in biomedical research. Powerful data manipulation tools available in the tidyverse R package ecosystem (Wickham et al., 2019) provide the infrastructure to assemble, clean and filter data prior to statistical analysis. Manual documentation of the steps taken in the data pipeline and the provenance of data is a cumbersome and error prone task which may restrict reproducibility. dtrackr is a wrapper around a subset of the standard tidyverse data manipulation tools that allows automatic tracking of the processing steps applied to a data set, prior to statistical analysis. It allows early detection and reporting of data quality problems, and automatically documents a pipeline of data transformations as a flowchart in a format suitable for scientific publication, including, but not limited to CONSORT diagrams (Schulz et al., 2010).

show abstract

Towards Specificationless Monitoring of Provenance-Emitting Systems

Stoffers¹,

Weinert²

2022

Preprint

View full text Add to dashboard Cite

Monitoring often requires insight into the monitored system as well as concrete specifications of expected behavior. More and more systems, however, provide information about their inner procedures by emitting provenance information in a W3C-standardized graph format. In this work, we present an approach to monitor such provenance data for anomalous behavior by performing spectral graph analysis on slices of the constructed provenance graph and by comparing the characteristics of each slice with those of a sliding window over recently seen slices. We argue that this approach not only simplifies the monitoring of heterogeneous distributed systems, but also enables applying a host of well-studied techniques to monitor such systems.

show abstract

Capturing Data Provenance from Statistical Software

Cited by 3 publications

References 7 publications

Towards Specificationless Monitoring of Provenance-Emitting Systems

Towards Specificationless Monitoring of Provenance-Emitting Systems

dtrackr: An R package for tracking the provenance of data

Towards Specificationless Monitoring of Provenance-Emitting Systems

Contact Info

Product

Resources

About