DIFF: a relational interface for large-scale data explanation

Abuzaid, Firas; Kraft, Peter; Suri, Sahaana; Gan, Edward; Xu, Eric; Shenoy, Atul; Ananthanarayan, Asvin; Sheu, John; Meijer, Erik; Wu, Xi; Naughton, Jeff; Bailis, Peter; Zaharia, Matei

doi:10.1007/s00778-020-00633-6

Cited by 23 publications

(54 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, expressing the query in Listing 2 using existing SQL clauses (see Figure 3) is much more verbose, requiring a complex sub-query for each (grouping, measure). Prior work have proposed similar succinct abstractions such as GROUPING SETs [17] and CUBE [22] (both widely adopted by most of the databases) and more recently DIFF [6], which share our overall goal that with an extended syntax, complex analytic queries are easier to write and optimize.…”

Section: Syntax and Semanticsmentioning

confidence: 99%

“…In contrast, we provide extensions to traditional query optimization and execution layers of relational databases to support comparative queries like other SQL queries. Similar to our approach, there have been database extensions [38,23,26,33], the most recent being the DIFF operator [6], that support association and frequent pattern mining. While our focus is on aggregate distance measures such as Lp norms (our focus), we share their goal that with an extended syntax, complex analytic queries are easier to write and optimize.…”

Section: Related Workmentioning

confidence: 99%

“…Note that the function DIFF is distinct from another operator[6] with similar name 2. We ignore the pth root as it does not affect the ranking of subsets.…”

mentioning

confidence: 99%

See 2 more Smart Citations

COMPARE: Accelerating Groupwise Comparison in Relational Databases for Data Analytics

Siddiqui¹,

Chaudhuri²,

Narasayya³

2021

Preprint

View full text Add to dashboard Cite

Data analysis often involves comparing subsets of data across many dimensions for finding unusual trends and patterns. While the comparison between subsets of data can be expressed using SQL, they tend to be complex to write, and suffer from poor performance over large and high-dimensional datasets. In this paper, we propose a new logical operator COMPARE for relational databases that concisely captures the enumeration and comparison between subsets of data and greatly simplifies the expressing of a large class of comparative queries. We extend the database engine with optimization techniques that exploit the semantics of COMPARE to significantly improve the performance of such queries. We have implemented these extensions inside Microsoft SQL Server, a commercial DBMS engine. Our extensive evaluation on synthetic and real-world datasets shows that COMPARE results in a significant speedup over existing approaches, including physical plans generated by today's database systems, user-defined functions (UDFs), as well as middleware solutions that compare subsets outside the databases.

show abstract

Section: Syntax and Semanticsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

COMPARE: Accelerating Groupwise Comparison in Relational Databases for Data Analytics

Siddiqui¹,

Chaudhuri²,

Narasayya³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Inference Query Explain [1,33,34,44] Rain [ Generating an explanation (i.e., a predicate) from inference data can certainly help to understand the answer to an inference query. However, an ML pipeline does not only contain the inference data but also the training and source data.…”

Section: Sql Explainmentioning

confidence: 99%

“…Unfortunately, existing SQL explanation approaches [1,33,34,44] are ill-equipped to address this setting (Table 1) because they are based on analysis of the query provenance. Although they can generate a predicate explanation over the inference data, the provenance analysis does not extend across model training nor UDFs, which are prevalent in data science workflows.…”

Section: Sql Explainmentioning

confidence: 99%

Explaining Inference Queries with Bayesian Optimization

Lockhart¹,

Peng²,

Wu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Obtaining an explanation for an SQL query result can enrich the analysis experience, reveal data errors, and provide deeper insight into the data. Inference query explanation seeks to explain unexpected aggregate query results on inference data; such queries are challenging to explain because an explanation may need to be derived from the source, training, or inference data in an ML pipeline. In this paper, we model an objective function as a black-box function and propose BOExplain, a novel framework for explaining inference queries using Bayesian optimization (BO). An explanation is a predicate defining the input tuples that should be removed so that the query result of interest is significantly affected. BO -a technique for finding the global optimum of a black-box function -is used to find the best predicate. We develop two new techniques (individual contribution encoding and warm start) to handle categorical variables. We perform experiments showing that the predicates found by BOExplain have a higher degree of explanation compared to those found by the state-of-the-art query explanation engines. We also show that BOExplain is effective at deriving explanations for inference queries from source and training data on three real-world datasets.

show abstract