Attribute classification using feature analysis

Nauman, Farrukh; Ho, Ching-Tien; Tian, Xuqing; Haas, Laura M.; Megiddo, Nimrod

doi:10.1109/icde.2002.994725

Cited by 22 publications

(14 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In their scheme, the data is effectively treated as categorical. Other related works in this area include the work of He, Chang and Han [11] on schema matching for the deep web and work on using distributional signatures for value mapping by Kang et al [12] and Naumann et al [13]. For related work in the Al community, we refer the reader to the survey by Doan and Halevy [2].…”

Section: G Our Contributionsmentioning

confidence: 99%

Validating Multi-column Schema Matchings by Type

Dai

Koudas

Srivastava

et al. 2008

2008 IEEE 24th International Conference on Data Engineering

View full text Add to dashboard Cite

Abstract-Validation of multi-column schema matchings is essential for successful database integration. This task is especially difficult when the databases to be integrated contain little overlapping data, as is often the case in practice (e.g., customer bases of different companies). Based on the intuition that values present in different columns related by a schema matching will have similar "semantic type", and that this can be captured using distributions over values ("statistical types"), we develop a method for validating 1-1 and compositional schema matchings.Our technique is based on three key technical ideas. First, we propose a generic measure for comparing two columns matched by a schema matching, based on a notion of information-theoretic discrepancy that generalizes the standard geometric discrepancy; this provides the basis for 1:1 matching. Second, we present an algorithm for "splitting" the string values in a column to identify substrings that are likely to match with the values in another column; this enables (multi-column) 1:m schema matching. Third, our technique provides an invalidation certificate if it fails to validate a schema matching. We complement our conceptual and algorithmic contributions with an experimental study that demonstrates the effectiveness and efficiency of our technique on a variety of database schemas and data sets.

show abstract

Section: G Our Contributionsmentioning

confidence: 99%

Validating Multi-column Schema Matchings by Type

Dai

Koudas

Srivastava

et al. 2008

2008 IEEE 24th International Conference on Data Engineering

View full text Add to dashboard Cite

show abstract

“…Previous generic tools include Cupid, SimilarityFlooding, and Clio [34][35][36]. As discussed, COMA++ is also generic and supports several schema languages, including XSD, OWL, and relational schemas, and it is able to deal with complex distributed XML schemas.…”

Section: Previous Solutions Vs Coma++mentioning

confidence: 99%

Matching large schemas: Approaches and evaluation

Rahm

2007

Information Systems

185

177

View full text Add to dashboard Cite

Current schema matching approaches still have to improve for large and complex Schemas. The large search space increases the likelihood for false matches as well as execution times. Further difficulties for Schema matching are posed by the high expressive power and versatility of modern schema languages, in particular user-defined types and classes, component reuse capabilities, and support for distributed schemas and namespaces. To better assist the user in matching complex schemas, we have developed a new generic schema matching tool, COMA++, providing a library of individual matchers and a flexible infrastructure to combine the matchers and refine their results. Different match strategies can be applied including a new scalable approach to identify context-dependent correspondences between schemas with shared elements and a fragment-based match approach which decomposes a large match task into smaller tasks. We conducted a comprehensive evaluation of the match strategies using large e-Business standard schemas. Besides providing helpful insights for future match implementations, the evaluation demonstrated the practicability of our system for matching large schemas. r

show abstract

“…Many schema matching systems perform data profiling to create attribute features, such as data type, average value length, and patterns, to compare feature vectors and align those attributes with the best matching ones [98,109].…”

Section: Use-cases For Data Profilingmentioning

confidence: 99%

Profiling relational data: a survey

Abedjan¹,

2015

View full text Add to dashboard Cite

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.

show abstract

Attribute classification using feature analysis

Cited by 22 publications

References 1 publication

Validating Multi-column Schema Matchings by Type

Validating Multi-column Schema Matchings by Type

Matching large schemas: Approaches and evaluation

Profiling relational data: a survey

Contact Info

Product

Resources

About