Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. Existing discovery algorithms are either brute force or have a high memory load and can thus be applied only to small datasets or samples. In this paper, the well-known Gordian algorithm [9] and "Aprioribased" algorithms [4] are compared and analyzed for further optimization. We greatly improve the Apriori algorithms through efficient candidate generation and statistics-based pruning methods. A hybrid solution HCA-Gordian combines the advantages of Gordian and our new algorithm HCA, and it outperforms all previous work in many situations.
Categories and Subject DescriptorsUnique discovery has high significance in several data management applications, such as data modeling, anomaly detection, query optimization, and indexing. Discovered uniques are good candidates for primary keys of a table. Therefore some literature refers to them as "candidate keys" [8]. The * A full version of this paper is available at [1] Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. term "composite key" is also used to highlight the fact that they comprise multiple columns [9]. We want to stress that the detection of uniques is a problem that can be solved exactly, while the detection of keys can only be solved heuristically. Uniqueness is a necessary precondition for a key, but only a human expert can "promote" a unique to a key, because uniques can appear by coincidence for a certain state of the data. In contrast, keys are consciously specified and denote a schema constraint. An important property of uniques and keys is their minimality. Minimal uniques are uniques of which no strict subsets hold the uniqueness property:In principle, to identify a column combination K of fixed size as a unique, all tuples ti must be scanned. A scan has a runtime of O(n) in the number n of rows. To detect duplicate values, one needs either a sort in O(n log n) or a hashing algorithm that needs O(n) space. Non-uniques are defined as follows:Definition 3. A column combination K that is not a unique is called a non-unique. Discovering all uniques of a table or relational instance can be reduced to the problem of discovering all minimal uniques. Every superset of a minimal unique is also a unique. Hence, in the rest of this paper the discovery of all uniques is synonymously used for discovering all minimal uniques. The exponential complexity is caused by the fact that for a relational schema R = {C1, . . . , Cm}, there are 2 m − 1 subsets K ⊆ R...