2010
DOI: 10.1007/978-3-642-14010-5_8
|View full text |Cite
|
Sign up to set email alerts
|

An Overview of XML Duplicate Detection Algorithms

Abstract: Abstract. Fuzzy duplicate detection aims at identifying multiple representations of real-world objects in a data source, and is a task of critical relevance in data cleaning, data mining, and data integration tasks. It has a long history for relational data, stored in a single table or in multiple tables with an equal schema. However, algorithms for fuzzy duplicate detection in more complex structures, such as hierarchies of a data warehouse, XML data, or graph data have only recently emerged. These algorithms… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
11
1

Year Published

2013
2013
2019
2019

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(13 citation statements)
references
References 38 publications
1
11
1
Order By: Relevance
“…This algorithm considers both the similarity of attribute contents and the relative importance of descendant elements, with respect to the overall similarity score. The algorithm presented here extends our previous work [6], [9] by (i) significantly improving efficiency and (ii) showing a more extensive set of experiments. Our contributions, especially compared to our previous work, can be summarized as follows: (i) we address the issue of efficiency of our initial solution [6] by introducing a novel pruning algorithm and studying how the order in which nodes are processed affects runtime.…”
Section: Introductionsupporting
confidence: 57%
See 2 more Smart Citations
“…This algorithm considers both the similarity of attribute contents and the relative importance of descendant elements, with respect to the overall similarity score. The algorithm presented here extends our previous work [6], [9] by (i) significantly improving efficiency and (ii) showing a more extensive set of experiments. Our contributions, especially compared to our previous work, can be summarized as follows: (i) we address the issue of efficiency of our initial solution [6] by introducing a novel pruning algorithm and studying how the order in which nodes are processed affects runtime.…”
Section: Introductionsupporting
confidence: 57%
“…This process can be manually tuned or performed automatically, using known duplicate objects from other databases; and (iii) we provide a more extensive evaluation of our algorithms than in our previous work. More specifically, we demonstrate the effectiveness of our algorithm on a larger number of data sets, from different domains than those used in [6], [9]. Also, we extensively evaluate efficiency.…”
Section: Introductionmentioning
confidence: 97%
See 1 more Smart Citation
“…These works differ from previous approaches since they were specifically designed to exploit the distinctive characteristics of XML object representations: their structure, textual content, and the semantics implicit in the XML labels. We briefly describe the main features of these methods here, and refer readers to [9] for a detailed theoretical and experimental comparison of these approaches.…”
Section: IIImentioning
confidence: 99%
“…An example of an interesting survey concerning strings in general is [10]. Work focused on coreference (duplicates) detection in the context of XML is [11]. An example of an approach employing fuzzy logic which might also be of interest to the reader is [12].…”
Section: ) Steps Comparisonmentioning
confidence: 99%