2017
DOI: 10.1007/s00778-017-0474-5
|View full text |Cite
|
Sign up to set email alerts
|

Adding data provenance support to Apache Spark

Abstract: Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today’s DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data provenance—tracking data through transformations—in Apache Spark. Data scientists using the Titian Spark extension wil… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
82
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 58 publications
(87 citation statements)
references
References 30 publications
(49 reference statements)
0
82
0
Order By: Relevance
“…This framework integrates the different schema matching and ontology alignment techniques for the purpose of information profiling. Metadata annotation can be efficient and does not heavily affect processing times of datasets in the DL as shown in related experiments like [13], [14] and in our experiments in Section VI.…”
Section: A Framework For Content Metadata Managementmentioning
confidence: 81%
See 1 more Smart Citation
“…This framework integrates the different schema matching and ontology alignment techniques for the purpose of information profiling. Metadata annotation can be efficient and does not heavily affect processing times of datasets in the DL as shown in related experiments like [13], [14] and in our experiments in Section VI.…”
Section: A Framework For Content Metadata Managementmentioning
confidence: 81%
“…Currently, data profiling and annotation is of great importance for research in DL architectures and is currently a hot topic for research [3], [12], [13]. Some techniques and approaches were previously investigated, but are mainly focused on relational content metadata [7], [10], free-text metadata [13], or data provenance metadata [1], [14]. Most of the current research efforts are suggesting the need for a governed metadata management process for integrating different varieties of BD [8], [13], [15].…”
Section: Related Workmentioning
confidence: 99%
“…A major drawback of RAMP and Newt is that they do not provide access to the intermediate data of the computation (in contrast to [1] that offers this functionality); consequently, these two systems cannot provide the How provenance of an output record. Based on this limitation, Titian [27] made some nice progress in extending Spark [45] with step-by-step provenance tracking. Titian materializes the dependencies between individual records in a Spark job (including the intermediate ones), and offers an API for interactive forward and backward tracing of dependencies.…”
Section: Related Workmentioning
confidence: 99%
“…Datalog-based Native Operator Level According to DTaP* [46] (NDlog Engine on ns-3) (NDlog) Theorem 1 1 The current version of Titian does not support iteration through GraphX [44]. 2 Newt has been applied to Hyracks and Hadoop [31], and to Spark [27], all of which support DAG dataflows. * These systems are not general-purpose data processing systems but they offer interesting features regarding provenance management.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation