XML has evolved to the format of choice for exposing data over the web. Together with mature and maturing standards for querying XML (XSLT, XPath, and XQuery) the basic infrastructure for integrating multiple heterogeneous data sources is there. However, the versatility of XML as a data model and the unrestricted expressive power of XML query languages can lead to rather complex integration architectures, where low level syntactic heterogeneities and semantic heterogeneities are overcome all at once by means of complex query expressions. This paper explores how the Web Ontology Language OWL can be used as a more abstract modelling layer on top of XML data sources, described by an XML Schema, to which extent the semantic relationships provided by OWL can be used for mapping heterogeneous data sources to a common global schema, and how the inference mechanisms of OWL can be used to check the consistency of such mappings. Moreover, it introduces a query language for OWL as a natural extension of XQuery, and describes how these queries against a global schema are translated to XQueries against the original data sources.
The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Objects may be duplicates, even though they are not identical due to errors and missing data.Traditional scenarios for duplicate detection are data warehouses, which are populated from several data sources. Duplicate detection here is part of the data cleansing process to improve data quality for the data warehouse. More recently in application scenarios like web portals, that offer users unified access to several data sources, or meta search engines, that distribute a search to several other resources and finally merge the individual results, the problem of duplicate detection is also present. In such scenarios no long and expensive data cleansing process can be carried out, but good duplicate estimations must be available directly.The most common approaches to duplicate detection use either rules or a weighted aggregation of similarity measures between the individual attributes of potential duplicates. However, choosing the appropriate rules, similarity functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. For this reason, these approaches entail significant costs.This thesis presents an unsupervised, domain independent approach to duplicate detection that starts with a broad alignment of potential duplicates, and analyses the distribution of observed similarity values among these potential duplicates and among representative sample non-duplicates to improve the initial alignment. To this end, a refinement of the classic Fellegi-Sunter model for record linkage is developed, which makes use of these distributions to iteratively remove clear non-duplicates from the set of potential duplicates. Alternatively also machine learning methods like Support Vector Machines are used and compared with the refined Fellegi-Sunter model.iii iv ABSTRACT Additionally, the presented approach is not only able to align flat records, but makes also use of related objects, which may significantly increase the alignment accuracy, depending on the application.Evaluations show that the approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches. Deutsche ZusammenfassungDas Problem zu erkennen, dass verschiedene Datenbankeinträge sich auf das selbe reale Objekt beziehen, ist in der Literatur als "duplicate detection" (Duplikaterkennung) oder "record linkage" bekannt.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.