An Overview of XML Duplicate Detection Algorithms

Calado, Pável; Herschel, Melanie; Leitão, Luís

doi:10.1007/978-3-642-14010-5_8

Cited by 12 publications

(13 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This algorithm considers both the similarity of attribute contents and the relative importance of descendant elements, with respect to the overall similarity score. The algorithm presented here extends our previous work [6], [9] by (i) significantly improving efficiency and (ii) showing a more extensive set of experiments. Our contributions, especially compared to our previous work, can be summarized as follows: (i) we address the issue of efficiency of our initial solution [6] by introducing a novel pruning algorithm and studying how the order in which nodes are processed affects runtime.…”

Section: Introductionsupporting

confidence: 57%

“…This process can be manually tuned or performed automatically, using known duplicate objects from other databases; and (iii) we provide a more extensive evaluation of our algorithms than in our previous work. More specifically, we demonstrate the effectiveness of our algorithm on a larger number of data sets, from different domains than those used in [6], [9]. Also, we extensively evaluate efficiency.…”

Section: Introductionmentioning

confidence: 97%

“…Our contributions, especially compared to our previous work, can be summarized as follows: (i) we address the issue of efficiency of our initial solution [6] by introducing a novel pruning algorithm and studying how the order in which nodes are processed affects runtime. A major result is that XMLDup now outperforms DogmatiX [5], a previously more efficient state of the art algorithm for XML duplicate detection [9]; (ii) we describe how to increase efficiency when a slight drop in recall, i.e., in the number of identified duplicates, is acceptable. This process can be manually tuned or performed automatically, using known duplicate objects from other databases; and (iii) we provide a more extensive evaluation of our algorithms than in our previous work.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient and Effective Duplicate Detection in Hierarchical Data

Leitão

Calado²,

Herschel

2013

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Section: Introductionsupporting

confidence: 57%

Section: Introductionmentioning

confidence: 97%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient and Effective Duplicate Detection in Hierarchical Data

Leitão

Calado²,

Herschel

2013

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

“…These works differ from previous approaches since they were specifically designed to exploit the distinctive characteristics of XML object representations: their structure, textual content, and the semantics implicit in the XML labels. We briefly describe the main features of these methods here, and refer readers to [9] for a detailed theoretical and experimental comparison of these approaches.…”

Section: IIImentioning

confidence: 99%

EDDDS: An Efficient Duplicate Data Detection System

Dhake¹,

S.S.²,

Y.R.³

et al. 2015

International Journal of Advanced Research in Computer and Comm

View full text Add to dashboard Cite

Duplicate Detection is critical task of any database of any organization. Duplicates are nothing but the same real time entities or objects are presented in the form of different structure and in the different formats. We can find out the duplicates in relational data, in complex data and hierarchical data like XML. There are lots of works already presented in the past for finding the duplicates in the relational data. But nowadays there is more focus on finding duplicates in the XML data. Because of XML is very popular for data storing and extensively used for data exchange between the organizations. Here we have done an extensive literature survey on this topic and proposed a duplicate detection method that incorporates some of the existing paper's ideas and some of our original ideas. In addition to improving the efficiency and effectiveness, we also checks for its typographical errors when comparing the two XML elements. To test the correctness of our method, we are comparing it with existing duplicate detection system, and giving more focus on how we get higher precision and recall values in the various datasets we have used.

show abstract

“…An example of an interesting survey concerning strings in general is [10]. Work focused on coreference (duplicates) detection in the context of XML is [11]. An example of an approach employing fuzzy logic which might also be of interest to the reader is [12].…”

Section: ) Steps Comparisonmentioning

confidence: 99%

Coreference detection in XML metadata

Szymczak

Zadrożny

Tré

2013

2013 Joint IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS)

View full text Add to dashboard Cite

Abstract-Preserving data quality is an important issue in data collection management. One of the crucial issues hereby is the detection of duplicate objects (called coreferent objects) which describe the same entity, but in different ways. In this paper we present a method for detecting coreferent objects in metadata, in particular in XML schemas. Our approach consists in comparing the paths from a root element to a given element in the schema. Each path precisely defines the context and location of a specific element in the schema. Path matching is based on the comparison of the different steps of which paths are composed. The uncertainty about the matching of steps is expressed with possibilistic truth values and aggregated using the Sugeno integral. The discovered coreference of paths can help for determining the coreference of different XML schemas.

show abstract

An Overview of XML Duplicate Detection Algorithms

Cited by 12 publications

References 38 publications

Efficient and Effective Duplicate Detection in Hierarchical Data

Efficient and Effective Duplicate Detection in Hierarchical Data

EDDDS: An Efficient Duplicate Data Detection System

Coreference detection in XML metadata

Contact Info

Product

Resources

About