Divide &amp; conquer-based inclusion dependency discovery

Papenbrock, Thorsten; Kruse, Sebastian; Quiané-Ruiz, Jorge-Arnulfo; Naumann, Felix

doi:10.14778/2752939.2752946

Cited by 62 publications

(23 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to compute the above optimization problem, we try to extract the roots of the first derivative function of Equation 33( i.e., f (r, α 1 , α 2 , b)) with respect to r. However, the derivative function is a polynomial function with degree of r larger than four. According to Abel's impossibility theorem [39], there is no algebraic solution, thus we try to give the numerical solution.…”

Section: Then the Variance Of Gb-kmv Methods By Equation 32 Ismentioning

confidence: 99%

See 1 more Smart Citation

GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search

Yang

Zhang

et al. 2019

2019 IEEE 35th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

In this paper, we study the problem of approximate containment similarity search. Given two records Q and X, the containment similarity between Q and X with respect to Q is |Q∩X| |Q| . Given a query record Q and a set of records S, the containment similarity search finds a set of records from S whose containment similarity regarding Q is not less than the given threshold. This problem has many important applications in commercial and scientific fields such as record matching and domain search. Existing solution relies on the asymmetric LSH method by transforming the containment similarity to well-studied Jaccard similarity. In this paper, we use a inherently different framework by transforming the containment similarity to set intersection. We propose a novel augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can achieve a much better trade-off between the sketch size and the accuracy. We provide a set of theoretical analysis to underpin the proposed augmented KMV sketch technique, and show that it outperforms the state-ofthe-art technique LSH-E in terms of estimation accuracy under practical assumption. Our comprehensive experiments on real-life datasets verify that GB-KMV is superior to LSH-E in terms of the space-accuracy trade-off, time-accuracy trade-off, and the sketch construction time. For instance, with similar estimation accuracy (F-1 score), GB-KMV is over 100 times faster than LSH-E on several real-life datasets.

show abstract

Section: Then the Variance Of Gb-kmv Methods By Equation 32 Ismentioning

confidence: 99%

“…In a dataset, the discovery of all inclusion dependencies is a crucial part of data profiling efforts. It has many applications such as foreign-key detection and data integration(e.g., [22], [31], [8], [33], [30]).…”

Section: Introductionmentioning

confidence: 99%

GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search

Yang

Zhang

et al. 2019

2019 IEEE 35th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

show abstract

“…Additionally, hybrid algorithms have been proposed in [87,102] that combine bottom-up and top-down traversal for additional pruning. The Binder algorithm uses divide and conquer principles to handle larger datasets than related work [114]. In the divide step, it splits the input dataset horizontally into partitions and vertically into buckets with the goal to fit each partition into main memory.…”

Section: Generating N-ary Inclusion Dependenciesmentioning

confidence: 99%

Profiling relational data: a survey

Abedjan¹,

2015

Self Cite

View full text Add to dashboard Cite

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.

show abstract

“…If they are not available in the schema, one can extract them from the database content. AutoMode uses the Binder algorithm [9] to discover INDs from the database, shown by the Exact IND discovery box in Figure 1, and generates all unary INDs implied by them.…”

Section: Generating Predicate Definitionsmentioning

confidence: 99%

AutoMode: Relational Learning with Less Black Magic

Picado

Pathak

Termehchy

et al. 2018

2018 IEEE 34th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

Relational databases are valuable resources for learning novel and interesting relations and concepts. Relational learning algorithms learn the Datalog definition of new relations in terms of the existing relations in the database. In order to constraint the search through the large space of candidate definitions, users must tune the algorithm by specifying a language bias. Unfortunately, specifying the language bias is done via trial and error and is guided by the expert's intuitions. Hence, it normally takes a great deal of time and effort to effectively use these algorithms. In particular, it is hard to find a user that knows computer science concepts, such as database schema, and has a reasonable intuition about the target relation in special domains, such as biology. We propose AutoMode, a system that leverages information in the schema and content of the database to automatically induce the language bias used by popular relational learning systems. We show that AutoMode delivers the same accuracy as using manually-written language bias by imposing only a slight overhead on the running time of the learning algorithm.

show abstract

Divide & conquer-based inclusion dependency discovery

Cited by 62 publications

References 16 publications

GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search

GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search

Profiling relational data: a survey

AutoMode: Relational Learning with Less Black Magic

Contact Info

Product

Resources

About