Declarative Data Fusion – Syntax, Semantics, and Implementation

Bleiholder, Jens; Naumann, Felix

doi:10.1007/11547686_5

Cited by 32 publications

(21 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Silk Link Discovery Framework can be downloaded from its official homepage 4 , which is also the source for the documentation, examples and updates. It is an open source tool with the source code and the detailed developer documentation available online 5 .…”

Section: Silk: Functionality and Main Conceptsmentioning

confidence: 99%

See 1 more Smart Citation

Interlinking and Knowledge Fusion

Bryl

Bizer

Isele³

et al. 2014

Linked Open Data -- Creating Knowledge Out of Interlinked Data

View full text Add to dashboard Cite

show abstract

Section: Silk: Functionality and Main Conceptsmentioning

confidence: 99%

“…The Sieve Data Fusion module is inspired by [4], a framework for data fusion in the context of relational databases that includes three major categories of conflict handling strategies:…”

Section: Fusion Functionsmentioning

confidence: 99%

Interlinking and Knowledge Fusion

Bryl

Bizer

Isele³

et al. 2014

Linked Open Data -- Creating Knowledge Out of Interlinked Data

View full text Add to dashboard Cite

show abstract

“…These functions are applied for individual attributes under certain conditions in a specified order. Recent work proposes a declarative specification of conflict resolution strategies [Naumann and Häussler 2002;Bleiholder and Naumann 2005]. Our work is orthogonal to this body of work.…”

Section: Conflict Resolutionmentioning

confidence: 99%

Improving data quality by source analysis

Müller

Freytag

Leser

2012

J. Data and Information Quality

View full text Add to dashboard Cite

In many domains, data cleaning is hampered by our limited ability to specify a comprehensive set of integrity constraints to assist in identification of erroneous data. An alternative approach to improve data quality is to exploit different data sources that contain information about the same set of objects. Such overlapping sources highlight hot-spots of poor data quality through conflicting data values and immediately provide alternative values for conflict resolution. In order to derive a dataset of high quality, we can merge the overlapping sources based on a quality assessment of the conflicting values. The quality of the resulting dataset, however, is highly dependent on our ability to asses the quality of conflicting values effectively.The main objective of this article is to introduce methods that aid the developer of an integrated system over overlapping, but contradicting sources in the task of improving the quality of data. Value conflicts between contradicting sources are often systematic, caused by some characteristic of the different sources. Our goal is to identify such systematic differences and outline data patterns that occur in conjunction with them. Evaluated by an expert user, the regularities discovered provide insights into possible conflict reasons and help to assess the quality of inconsistent values. The contributions of this article are two concepts of systematic conflicts: contradiction patterns and minimal update sequences. Contradiction patterns resemble a special form of association rules that summarize characteristic data properties for conflict occurrence. We adapt existing association rule mining algorithms for mining contradiction patterns. Contradiction patterns, however, view each class of conflicts in isolation, sometimes leading to largely overlapping patterns. Sequences of set-oriented update operations that transform one data source into the other are compact descriptions for all regular differences among the sources. We consider minimal update sequences as the most likely explanation for observed differences between overlapping data sources. Furthermore, the order of operations within the sequences point out potential dependencies between systematic differences. Finding minimal update sequences, however, is beyond reach in practice. We show that the problem already is NP-complete for a restricted set of operations. In the light of this intractability result, we present heuristics that lead to convincing results for all examples we considered.

show abstract

“…This is a data integration scenario, in which movies from two sources are first mapped to a common schema, and then de-duplicated. The sources are the Internet Movie Database IMDB and the German Movie Repository FILMDI-ENST 6 . Fig.…”

Section: Use Casesmentioning

confidence: 99%

“…Fusion has received less attention, and all work focuses on relational data. The authors of [6] propose an operator that extends SQL to support declarative fusion and implemented in the HumMer system [5], and we plan to develop a similar technique for XML data. Other solutions include TSIMMIS [18] relying on source preference in the context of data integration, and ConQuer [13] that filters inconsistencies out of query results.…”

Section: Related Workmentioning

confidence: 99%

Declarative XML Data Cleaning with XClean

Weis

Manolescu

2007

Advanced Information Systems Engineering

View full text Add to dashboard Cite

Abstract. Data cleaning is the process of correcting anomalies in a data source, that may for instance be due to typographical errors, or duplicate representations of an entity. It is a crucial task in customer relationship management, data mining, and data integration. With the growing amount of XML data, approaches to effectively and efficiently clean XML are needed, an issue not addressed by existing data cleaning systems that mostly specialize on relational data. We present XClean, a data cleaning framework specifically geared towards cleaning XML data. XClean's approach is based on a set of cleaning operators, whose semantics is well-defined in terms of XML algebraic operators. Users may specify cleaning programs by combining operators by means of a declarative XClean/PL program, which is then compiled into XQuery. We describe XClean's operators, language, and compilation approach, and validate its effectiveness through a series of case studies. MotivationData cleaning is the process of correcting anomalies in a data source, that may for instance be due to typographical errors, formatting differences, or duplicate representations of an entity. It is a crucial task in customer relationship management, data mining, and data integration. Relational data cleaning is performed in specialized frameworks [14,21,26], or by specialized modules in modern relational database management systems [8].With the growing popularity of XML and the large volumes of XML data becoming available, approaches to effectively and efficiently clean XML data are needed. For example, consider DBLP 3 whose data is available in XML format. Fig. 1 shows an excerpt of the DBLP entry of one of this paper's authors, on which we observe several XML data cleaning issues. First, the SIGMOD conference is represented by the conference abbreviation, the string "Conference", and the year of the conference, whereas VLDB is only represented by its abbreviation and year. Clearly, both conferences are represented differently, which can be corrected through data cleaning. A second example is the representation of author names. In the bottom publication, the first author is represented by its firstname and lastname, whereas the second author's firstname is 3

show abstract

Declarative Data Fusion – Syntax, Semantics, and Implementation

Cited by 32 publications

References 9 publications

Interlinking and Knowledge Fusion

Interlinking and Knowledge Fusion

Improving data quality by source analysis

Declarative XML Data Cleaning with XClean

Contact Info

Product

Resources

About