Approximating the number of unique values of an attribute without sorting

Astrahan, M. M.; Schkolnick, Mario; Whang, Kyu-Young

doi:10.1016/0306-4379(87)90014-7

Cited by 33 publications

(31 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Haas et al [65] base their estimation on data samples and describe and empirically compare various estimators from the literature. Other works do scan the entire data but use only a small amount of memory to hash the values and estimate the number of distinct values, an early example being [11].…”

Section: Cardinalitiesmentioning

confidence: 99%

Profiling relational data: a survey

Abedjan¹,

2015

View full text Add to dashboard Cite

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.

show abstract

Section: Cardinalitiesmentioning

confidence: 99%

Profiling relational data: a survey

Abedjan¹,

2015

View full text Add to dashboard Cite

show abstract

“…Each leaf node (as well as each internal node) of the complex structure is uniquely identified by an XPath expression; the set of XPaths (i.e., nodes) for a given data source can be computed simply by submitting every XML document from the data source to a generic SAX parser. There is a data set associated with each leaf node, and for the ith leaf node 2 ,…,v n }, where v j is the data value at the ith leaf node in the jth XML document from the specified data source. Observe that ds i is a multiset, because the same data value can appear in multiple documents.…”

Section: Repository Dataflowmentioning

confidence: 99%

“…The second key component of the synopsis is a hash-counting data structure. The idea is that each incoming data element is fed into a "probabilistic counting" algorithm; see, e.g., [2]. Such algorithms estimate the number of distinct values in a data set in a single pass using a fixed amount of memory.…”

Section: Synopsis Creationmentioning

confidence: 99%

“…One such measure is the Jaccard metric, which is a normalized measure of the intersection of the domains of two data sets. The idea is to use the probabilistic counting algorithm associated with the hash count signature to estimate C1 = Card[Domain(ds 1 )] and C2 = Card[Domain(ds 2 )]. Then the two hashcount structures are merged into a combined hash-count structure and the probabilistic counting algorithm is then applied to obtain an estimate of C3 = Card[Domain(ds 1 ) ∪ Domain(ds 2 )].…”

Section: Similarity For Hash-count Signaturesmentioning

confidence: 99%

“…On the other hand, two patterns may define the same domain in a different way, resulting in an incorrect assessment of their similarity. For instance, patterns " [1][2][3]" and "1 | 2 | 3" look very different but define the same domain. A value-based similarity measure is slightly more expensive but more reliable.…”

Section: Similarity For Hash-count Signaturesmentioning

confidence: 99%

See 2 more Smart Citations

Toward Automated Large-Scale Information Integration and Discovery

Brown

Haas

Myllymaki

et al. 2005

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. The high cost of data consolidation is the key market inhibitor to the adoption of traditional information integration and data warehousing solutions. In this paper, we outline a next-generation integrated database management system that takes traditional information integration, content management, and data warehouse techniques to the next level: the system will be able to integrate a very large number of information sources and automatically construct a global business view in terms of "Universal Business Objects". We describe techniques for discovering, unifying, and aggregating data from a large number of disparate data sources. Enabling technologies for our solution are XML, web services, caching, messaging, and portals for real-time dashboarding and reporting.

show abstract

An Introduction to Data Profiling

Abedjan

2018

Lecture Notes in Business Information Processing

View full text Add to dashboard Cite

Approximating the number of unique values of an attribute without sorting

Cited by 33 publications

References 1 publication

Profiling relational data: a survey

Profiling relational data: a survey

Toward Automated Large-Scale Information Integration and Discovery

An Introduction to Data Profiling

Contact Info

Product

Resources

About