2019 IEEE 35th International Conference on Data Engineering (ICDE) 2019
DOI: 10.1109/icde.2019.00025
|View full text |Cite
|
Sign up to set email alerts
|

Fine-Grained Provenance for Matching & ETL

Abstract: Data provenance tools capture the steps used to produce analyses. However, scientists must choose among work-flow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are wel… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 25 publications
0
7
0
Order By: Relevance
“…We found that the execution cost of function operator with an expensive UDF tends to be dominant in the overall processing time compared with other relational operators. To reduce the time of rerunning tasks and segments in Rerun and FM, we utilize the semi-join pushdown optimization used in [57]. When we get output tuples O of a task and Ō ⊆ O are specified as targets for the augmented lineage derivation, we apply the semi-join O Ō and then push it down along the operator tree.…”
Section: Deployment Of Intermediate Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…We found that the execution cost of function operator with an expensive UDF tends to be dominant in the overall processing time compared with other relational operators. To reduce the time of rerunning tasks and segments in Rerun and FM, we utilize the semi-join pushdown optimization used in [57]. When we get output tuples O of a task and Ō ⊆ O are specified as targets for the augmented lineage derivation, we apply the semi-join O Ō and then push it down along the operator tree.…”
Section: Deployment Of Intermediate Resultsmentioning
confidence: 99%
“…When we get output tuples O of a task and Ō ⊆ O are specified as targets for the augmented lineage derivation, we apply the semi-join O Ō and then push it down along the operator tree. Although [57] does not consider the difference operator, the pushdown transformation…”
Section: Deployment Of Intermediate Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…DISC systems natively support nested data formats such as JSON, XML, Parquet, or Protocol Buffers. Provenance capture for DISC systems has been studied in, e.g., [1,15,21,22,28,42]. Why-not explanations are practically relevant in these systems.…”
Section: Related Workmentioning
confidence: 99%
“…The main difference is that we have focused on a restricted set of core operators (with some of those in [32] missing and others combined in one) with the specific goal of providing a solid basis to an effective technique for capturing data provenance of classical preprocessing operators. We point out that our algebra can be easily extended to include operators implementing other ETL/ELT-like transformations, such as join, intersection, and union, whose fine-grained provenance capture have been described elsewhere [50].…”
Section: Data Manipulation Modelmentioning
confidence: 99%