Discovering all most specific sentences

Gunopulos, Dimitrios; Khardon, Roni; Mannila, Heikki; Saluja, Sanjeev; Toivonen, Hannu; Sharma, Ram Sewak

doi:10.1145/777943.777945

Cited by 183 publications

(138 citation statements)

References 32 publications

Supporting

Mentioning

138

Contrasting

Order By: Relevance

“…To discover all minimal uniques and maximal non-uniques of a relational instance, in the worst case, one has to visit all subsets of the given relation, no matter the strategy (breadth-first or depth-first) or direction (bottom-up or top-down). Thus, the discovery of all minimal uniques and maximal non-uniques of a relational instance is an NP-hard problem and even the solution set can be exponential [64].…”

Section: Unique Column Combinations and Keysmentioning

confidence: 99%

Profiling relational data: a survey

Abedjan¹,

2015

View full text Add to dashboard Cite

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.

show abstract

Section: Unique Column Combinations and Keysmentioning

confidence: 99%

Profiling relational data: a survey

Abedjan¹,

2015

View full text Add to dashboard Cite

show abstract

“…Enumeration of the MEPSs was also studied in [10] as an instance of a more general problem [9] which includes frequent maximal itemset mining. The proposed enumeration algorithm in [10] is not fast though it also let us know the VC dimension of the intersection closure of the MEPS set.…”

Section: Related Workmentioning

confidence: 99%

“…One is the group of algorithms that directly find MFIs: LCMmax [1], Mafia [15] and GenMax [16]. The other is the group of algorithms that solves the dual problem, i.e., the problem of enumerating minimal infrequent itemsets: All MSS [9] and IBE [17]. In MFI-mining, the algorithms in the former group are faster than the algorithms in the latter group according to the contest results [17,18,1].…”

Section: Converting Algorithms From Mfi Miningmentioning

confidence: 99%

“…As for enumeration approach, fortunately, enumeration of the MEPSs is known as an instance of a more general problem [9,10] which includes enumeration of the maximal frequent itemsets (MFIs), for which efficient algorithms have been actively developed recently. We make clear the relation between enumeration of the MEPSs and that of the MFIs, and show how to convert an algorithm for the MFIs to that for the MEPSs.…”

Section: Introductionmentioning

confidence: 99%

“…Appendix A. IBE.R IBE (Irredundant Border Enumerator [17]) is an improved algorithm of All MSS [9]. These algorithms obtain a new member of a target maximal-set family by finding and expanding a seed set, which is a subset of an unknown member.…”

mentioning

confidence: 99%

See 2 more Smart Citations

An efficient construction and application usefulness of rectangle greedy covers

Ōuchi

Nakamura

Kudo

2014

Pattern Recognition

View full text Add to dashboard Cite

We develop efficient construction methods of a rectangle greedy cover (RGC), and evaluate its usefulness in applications. An RGC is a greedy cover of the set of given positive instances by exclusive axis-parallel hyperrectangles, namely, axis-parallel hyperrectangles that exclude all the given negative instances. An RGC is expected to be a compact classification rule with high readability because the number of its component rectangles is expected to be small and it can be seen as a disjunctive normal form, which is one of the most readable representations for us.We propose two approaches of RGC construction: enumeration approach and direct approach. In enumeration approach, the maximal exclusive positive subsets (MEPSs) are enumerated first and then an ordinary greedy set covering is done using the enumerated MEPSs. We make clear the relation between enumeration of the maximal frequent itemsets and enumeration of the MEPSs, and convert an efficient enumeration algorithm LCMmax [1] of maximal frequent itemsets to an enumeration algorithm LCMmax.R naive of MEPSs. We also develop a more efficient version of LCMmax.R naive , or LCMmax.R, by incorporating effective dynamic reordering of instances using excluded frequency and bit-parallel exclusiveness check. In direct approach, each component MEPS of an RGC is searched not from enumerated MEPSs but directly from the dataset that consists of the remaining uncovered positive instances and the whole negative instances. We developed an algorithm called MRF that efficiently finds an maximum-sized MEPS for given positive and negative instances. MRF is made from LCMmax.R by modifying it so as to find a maximum-sized MEPS only. An RGC is constructed by MRF repetition, that is, by repeatedly executing MRF using the remaining uncovered positive instances.According to our experimental evaluation using UCI-repository datasets, LCMmax.R was about 5-11 times faster than LCMmax.R naive , which indicates effectiveness of the introduced two improvements. MRF repetition, however, was significantly faster than LCMmax.R, and it was enough fast to use practically for small datasets. The experimental results using UCI-repository datasets also showed that accuracy of a nearest rectangle classifier using an RGC is close to that using the hyperrectangles output by the randomized subclass method (RSM) [2] though the number of component rectangles of an RGC is significantly smaller than the number of the hyperrectangles output by RSM. The performance of RGC was also shown to be comparable to that of the six popular classifiers including logistic regression and support vector machine. The disjunctive normal form representation of the classification rules obtained by RGC was demonstrated to be simpler and more readable for us than that obtained by RSM and C4.5.

show abstract

Discovering and Exploiting Statistical Properties for Query Optimization in Relational Databases: A Survey

Haas

Ilyas

Lohman

et al. 2009

Statistical Analysis

View full text Add to dashboard Cite

Discovering and exploiting statistical features in relational datasets is key to query optimization in a relational database management system (rdbms), and is also needed for database design, cleaning, and integration. This paper surveys a variety of methods for automatically discovering important statistical features such as correlations, functional dependencies, keys, and algebraic constraints. We discuss proactive approaches in which the data is scanned or sampled (periodically, at optimization time or at query time), or in which exploratory queries are executed. Also discussed are reactive approaches that monitor the results of the query processing. Finally, we discuss methods for dealing with the practical challenges of maintaining statistical information in the face of heavy system utilization, and of dealing with inconsistencies that arise from incomplete cardinality models, use of multiple discovery methods, or changes in the underlying data over time. 

show abstract

Discovering all most specific sentences

Cited by 183 publications

References 32 publications

Profiling relational data: a survey

Profiling relational data: a survey

An efficient construction and application usefulness of rectangle greedy covers

Discovering and Exploiting Statistical Properties for Query Optimization in Relational Databases: A Survey

Contact Info

Product

Resources

About