An Efficient Algorithm to Compute the Candidate Keys of a Relational Database Schema

Saiedian, Hossein; Spencer, Thomas H.

doi:10.1093/comjnl/39.2.124

Cited by 24 publications

(19 citation statements)

References 4 publications

(7 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, the discovery of FDs [15,16,19] is also very similar to the problem of discovering uniques, as uniques functionally determine all other individual columns within a table. Thus, some approaches for unique discovery incorporate the knowledge on existing FDs [1,24]. Saiedian and Spencer presented an FD-based technique that supports unique discovery by identifying columns that are definitely part of all uniques and columns that are never part of any unique [24].…”

Section: Related Workmentioning

confidence: 99%

“…Thus, some approaches for unique discovery incorporate the knowledge on existing FDs [1,24]. Saiedian and Spencer presented an FD-based technique that supports unique discovery by identifying columns that are definitely part of all uniques and columns that are never part of any unique [24]. They showed that given a minimal set of FDs, any column that appears only on the left side of a FD must be part of all keys.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Scalable discovery of unique column combinations

et al. 2013

View full text Add to dashboard Cite

The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself.In this paper, we devise Ducc, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the efficiency of Ducc to scale up and out.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Scalable discovery of unique column combinations

et al. 2013

View full text Add to dashboard Cite

show abstract

“…Discovered uniques are good candidates for primary keys of a table. Therefore some literature refers to them as "candidate keys" [8]. The * A full version of this paper is available at [1] Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.…”

Section: Unique Column Combinationsmentioning

confidence: 99%

“…Saeidian and Spencer present an FD-based approach that supports unique discovery [8]. They showed that given a minimal set of FDs , any column that appears only on the left side of the given FDs must be part of all keys and columns that appear only on the right side of the FDs cannot be part of any key.…”

Section: Related Workmentioning

confidence: 99%

Advancing the discovery of unique column combinations

Abedjan

Naumann

2011

Proceedings of the 20th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are either brute force or have a high memory load and can thus be applied only to small datasets or samples. In this paper, the well-known Gordian algorithm [9] and "Aprioribased" algorithms [4] are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCA-Gordian combines the advantages of Gordian and our new algorithm HCA, and it outperforms all previous work in many situations. Categories and Subject DescriptorsUnique discovery has high significance in several data management applications, such as data modeling, anomaly detection, query optimization, and indexing. Discovered uniques are good candidates for primary keys of a table. Therefore some literature refers to them as "candidate keys" [8]. The * A full version of this paper is available at [1] Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. term "composite key" is also used to highlight the fact that they comprise multiple columns [9]. We want to stress that the detection of uniques is a problem that can be solved exactly, while the detection of keys can only be solved heuristically. Uniqueness is a necessary precondition for a key, but only a human expert can "promote" a unique to a key, because uniques can appear by coincidence for a certain state of the data. In contrast, keys are consciously specified and denote a schema constraint. An important property of uniques and keys is their minimality. Minimal uniques are uniques of which no strict subsets hold the uniqueness property:In principle, to identify a column combination K of fixed size as a unique, all tuples ti must be scanned. A scan has a runtime of O(n) in the number n of rows. To detect duplicate values, one needs either a sort in O(n log n) or a hashing algorithm that needs O(n) space. Non-uniques are defined as follows:Definition 3. A column combination K that is not a unique is called a non-unique. Discovering all uniques of a table or relational instance can be reduced to the problem of discovering all minimal uniques. Every superset of a minimal unique is also a unique. Hence, in the rest of this paper the discovery of all uniques is synonymously used for discovering all minimal uniques. The exponential complexity is caused by the fact that for a relational schema R = {C1, . . . , Cm}, there are 2 m − 1 subsets K ⊆ R...

show abstract

“…Relative covers have been used previously by Saiedian and Spencer in [17] under the name contraction. We will be using them in the context of implication dependencies, which are functional dependencies over an attribute set of functional dependencies, describing implication between them (see Section 3.4).…”

Section: Relative Coversmentioning

confidence: 99%

Autonomous sets for the hypergraph of all canonical covers

Köhler

2011

Ann Math Artif Intell

View full text Add to dashboard Cite

We present a method for decomposing a hypergraph with certain regularities into smaller hypergraphs, in a "direct product"-like fashion. By applying this to the set of all canonical covers of a given set of functional dependencies, we obtain more efficient methods for solving several optimization problems in database design. These include finding one or all "optimal" covers w.r.t. different criteria, which can help to synthesize better decompositions, and to reduce the cost of constraint checking. As a central step we investigate how the hypergraph of all canonical covers can be computed efficiently. Our results suggest that decomposed representations of this hypergraph are usually small and can be obtained rather quickly, even if the number of canonical covers is huge.

show abstract

An Efficient Algorithm to Compute the Candidate Keys of a Relational Database Schema

Cited by 24 publications

References 4 publications

Scalable discovery of unique column combinations

Scalable discovery of unique column combinations

Advancing the discovery of unique column combinations

Autonomous sets for the hypergraph of all canonical covers

Contact Info

Product

Resources

About