2016
DOI: 10.14778/2994509.2994530
|View full text |Cite
|
Sign up to set email alerts
|

Explaining outputs in modern data analytics

Abstract: We report on the design and implementation of a general framework for interactively explaining the outputs of modern data-parallel computations, including iterative data analytics. To produce explanations, existing works adopt a naive backward tracing approach which runs into known issues; naive backward tracing may identify: (i) too much information that is difficult to process, and (ii) not enough information to reproduce the output, which hinders the logical debugging of the program. The contribution of thi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 30 publications
(13 citation statements)
references
References 36 publications
0
13
0
Order By: Relevance
“…With the goal of minimal provenance and output reproducibility in the context of differential dataflow, Chothia et al [10] design custom rules for dataflow operators, i.e., map, reduce, join to record record-level data delta at each operator for each iteration and for each increment of dataflow execution. Their approach in part resembles FLOWDEBUG's StreamingOutlier influence function that captures influence over incremental computation.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…With the goal of minimal provenance and output reproducibility in the context of differential dataflow, Chothia et al [10] design custom rules for dataflow operators, i.e., map, reduce, join to record record-level data delta at each operator for each iteration and for each increment of dataflow execution. Their approach in part resembles FLOWDEBUG's StreamingOutlier influence function that captures influence over incremental computation.…”
Section: Related Workmentioning
confidence: 99%
“…While existing data provenance techniques modify the runtime of DISC frameworks, FLOWDEBUG does not require any modifications to the framework's runtime and instead provides an API on top of existing data structures such as Apache Spark RDDs, making it easier to adopt. Other data provenance approaches that leverage the notion of influence [10,40] or taint analysis [38] are limited in their generalizability, because they either rely on predefined, operator-specific data-partition strategies or require the costly practice of intercepting billions of system calls to process taint marks.…”
Section: Introductionmentioning
confidence: 99%
“…Selective refresh may not always update the same target outputs, for instance if the workflow contains a one-to-many operator followed by two non-monotonic aggregation operators [25]. The notion of unsafe selective refresh, and recent techniques to address it [10], highlight the value of leveraging the provenance literature to ensure correctness in interactive visualizations.…”
Section: Listing 3: Examples Of Tooltips and Details-on-demandmentioning
confidence: 99%
“…Finding such pair is extremely hard in DISC applications, because a user must synthesize two different input files, producing similar but not identical intermediate results in each stage. Chothia et al [12] is a provenance system implemented over a differential dataflow system, like Naiad [40]. Their approach is more focused on how to provide semantically correct explanations of outputs through replay by leveraging the properties of a differential dataflow system.…”
Section: Related Workmentioning
confidence: 99%