2019
DOI: 10.1145/3352591
|View full text |Cite
|
Sign up to set email alerts
|

Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection

Abstract: Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. We explain in detail, compare, and evaluate many of these algorithms and introduce three new cluste… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(14 citation statements)
references
References 32 publications
0
10
0
Order By: Relevance
“…In many duplicate detection problems, this is a nontrivial issue, and open area of research. 6 In our case, we observe that about 98% of pairs within groups do have a similarity above the threshold.…”
Section: Methodsmentioning
confidence: 45%
“…In many duplicate detection problems, this is a nontrivial issue, and open area of research. 6 In our case, we observe that about 98% of pairs within groups do have a similarity above the threshold.…”
Section: Methodsmentioning
confidence: 45%
“…There are many ML-based solutions to data quality problems, e.g. entity resolution [25,16,1]. In this article we have shown how ML and AI bring to the table new dimensions of data quality, and how techniques from explainable AI can help in this direction.…”
Section: Final Remarksmentioning
confidence: 92%
“…Other work has leveraged the graph's structure instead of just the weights between records. In [25], the authors proposed three algorithms to cluster the similarity graph based on structure rather than edge weights. They argue that graph-based transitive closure, such as in [26], produces high recall but low precision because the graph's structure is not considered during clustering.…”
Section: B Record-record Simiarlity-based Graph Entity Resolutionmentioning
confidence: 99%