Automated debugging in data-intensive scalable computing

Gulzar, Muhammad Ali; Interlandi, Matteo; Han, Xueyuan; Li, Mingda; Condie, Tyson; Kim, Miryung

doi:10.1145/3127479.3131624

Cited by 21 publications

(14 citation statements)

References 44 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This section discusses two examples of Apache Spark applications, inspired by the motivating example presented elsewhere [18], to show the benefit of FLOWDEBUG. FLOWDE-BUG targets commonly used big data analytics running on top of Apache Spark, but its key idea generalizes to any big data analytics running on data intensive scalable computing (DISC) frameworks.…”

Section: Motivating Examplementioning

confidence: 99%

“…An alternative approach would be to isolate a subset of input records contributing to each suspicious output by using search-based debugging [18] or data provenance [25], both of which have limitations related to inefficiency and imprecision, discussed below. Imprecision of Data Provenance.…”

Section: Running Examplementioning

confidence: 99%

“…In other words, narrowing down the scope of responsible inputs requires repetitive re-execution of the program with different inputs. For example, BigSift [18] would incur 41 runs for Figure 1a, since its black-box debugging procedure does not recognize that the given UDF at line 26 selects uses only two values (min and max) for each key group. Debugging Example 1 with FLOWDEBUG.…”

Section: Running Examplementioning

confidence: 99%

“…Alternatively, search-based debugging techniques [18,43] can be used for post-mortem analysis as they repetitively run the program with different input subsets and check whether a test failure appears. Thus, these black-box techniques require multiple re-runs with different input subsets, which can take several hours, if not days.…”

Section: Introductionmentioning

confidence: 99%

“…Compared to Titian [25], FLOWDEBUG improves precision by up to 99.9 percentage points. Compared to BigSift [18], FLOWDEBUG is able to improve recall by up to 99.3 percentage points. Finally, FLOWDEBUG is able to perform debugging up to 51X faster than Titian and 1000X faster than BigSift while adding an instrumentation overhead of 0.4X -6.1X compared to Apache Spark.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Influence-based provenance for dataflow applications with taint propagation

Teoh

Gulzar

Kim

2020

Proceedings of the 11th ACM Symposium on Cloud Computing

Self Cite

View full text Add to dashboard Cite

Debugging big data analytics often requires a root cause analysis to pinpoint the precise culprit records in an input dataset responsible for incorrect or anomalous output. Existing debugging or data provenance approaches do not track fine-grained control and data flows in user-defined application code; thus, the returned culprit data is often too large for manual inspection and expensive post-mortem analysis is required. We design FLOWDEBUG to identify a highly precise set of input records based on two key insights. First, FLOWDEBUG precisely tracks control and data flow within user-defined functions to propagate taints at a fine-grained level by inserting custom data abstractions through automated source to source transformation. Second, it introduces a novel notion of influence-based provenance for many-to-one dependencies to prioritize which input records are more responsible than others by analyzing the semantics of a user-defined function used for aggregation. By design, our approach does not require any modification to the framework's runtime and can be applied to existing applications easily. FLOWDEBUG significantly improves the precision of debugging results by up to 99.9 percentage points and avoids repetitive reruns required for post-mortem analysis by a factor of 33 while incurring an instrumentation overhead of 0.4X-6.1X on vanilla Spark. CCS CONCEPTS • Information systems → MapReduce-based systems; • Theory of computation → Data provenance; • Software and its engineering → Software testing and debugging.

show abstract

Section: Motivating Examplementioning

confidence: 99%

Section: Running Examplementioning

confidence: 99%

Section: Running Examplementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations