2022
DOI: 10.2218/ijdc.v16i1.763
|View full text |Cite
|
Sign up to set email alerts
|

Capturing Data Provenance from Statistical Software

Abstract: We have created tools that automate one of the most burdensome aspects of documenting the provenance of research data: describing data transformations performed by statistical software.  Researchers in many fields use statistical software (SPSS, Stata, SAS, R, Python) for data transformation and data management as well as analysis.  The C2Metadata ("Continuous Capture of Metadata for Statistical Data") Project creates a metadata workflow paralleling the data management process by deriving provenance informatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1
1
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 7 publications
0
2
0
Order By: Relevance
“…Provenance data is either extracted from software systems after [43] or during their operation [24,27,47]. There is active work towards recording provenance information without instrumenting the system or process [3,5,18,39].…”
Section: Related Workmentioning
confidence: 99%
“…Provenance data is either extracted from software systems after [43] or during their operation [24,27,47]. There is active work towards recording provenance information without instrumenting the system or process [3,5,18,39].…”
Section: Related Workmentioning
confidence: 99%
“…The general capability however fits into a broader context of other provenance or data pipeline research. This includes initiatives such as C2Metadata (Alter et al, 2021), which focus on a language independent representation of a data pipeline, and R packages such as targets (Landau, 2021) which focus on documenting pipeline code, and managing the execution of a pipeline, or RDataTracker which focusses on tracking the execution of a arbitrary R script (Lerner et al, 2018). dtrackr takes a more data oriented approach, which could be complementary, in which we remain agnostic to the detail of a data pipeline script or nature of its execution, but capture a subset of the transformations applied to data alongside the data itself, thereby documenting the data state as it is being manipulated.…”
mentioning
confidence: 99%