2009 IEEE 25th International Conference on Data Engineering 2009
DOI: 10.1109/icde.2009.43
|View full text |Cite
|
Sign up to set email alerts
|

Large-Scale Deduplication with Constraints Using Dedupalog

Abstract: We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is "each paper has a unique publication venue"; if two paper references are duplicates, then their associated conference references must be duplicates as well. Our framework supports collective deduplication, meaning that we can dedupe both paper references and confe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
154
0

Year Published

2010
2010
2023
2023

Publication Types

Select...
7
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 137 publications
(155 citation statements)
references
References 41 publications
(68 reference statements)
1
154
0
Order By: Relevance
“…Since only one threshold is learned in the previous works, we use the mean of as the threshold. We compare DWDEI with two recent related works [35] and [36]. From the experimental results shown in Table 6 and the performance comparison on F-measure shown in Figure 7, two conclusions can be made.…”
Section: E Performance Comparison With Previous Related Work Using mentioning
confidence: 86%
“…Since only one threshold is learned in the previous works, we use the mean of as the threshold. We compare DWDEI with two recent related works [35] and [36]. From the experimental results shown in Table 6 and the performance comparison on F-measure shown in Figure 7, two conclusions can be made.…”
Section: E Performance Comparison With Previous Related Work Using mentioning
confidence: 86%
“…This prevents the result from degenerating to a single cluster (as such a cut uses no negative edges) or |V | clusters (as such a cut includes all positive edges). CC has found uses in problems involving unknown and possibly large number of clusters, such as entity deduplication [37], community detection in social networks [38], gene clustering [39], and image segmentation [10,[40][41][42][43].…”
Section: Correlation Clusteringmentioning
confidence: 99%
“…non composite S-keys). Other approaches aim to enrich the ontology and/or use these S-keys to generate identity links between pairs of instances that can be propagated to other pairs of instances ( [19,1]). Such approaches, are called collective or global approaches of data linking.…”
Section: Existing Workmentioning
confidence: 99%