2019 IEEE 35th International Conference on Data Engineering (ICDE) 2019
DOI: 10.1109/icde.2019.00027
|View full text |Cite
|
Sign up to set email alerts
|

A Semi-Supervised Framework of Clustering Selection for De-Duplication

Abstract: Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem where the goal is to put records corresponding to the same physical entity in the same cluster and putting records corresponding to different physical entities into different clusters.We introduce a framework which we call promise correlation clustering. Given a complete graph G with the edges labelled 0 and 1, the goal is to fin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 11 publications
(9 citation statements)
references
References 21 publications
0
9
0
Order By: Relevance
“…It is important to optimize the execution cost of FoRWaRD, particularly in the case of batch updates with missing information (nulls). It will also important to understand the performance of FoRWaRD (in comparison to alternatives) on machine-learning tasks other than column prediction: record linking [13,29], entity resolution [10,21], data imputation [22,41], data cleaning [1,36,41] and so on.…”
Section: Discussionmentioning
confidence: 99%
“…It is important to optimize the execution cost of FoRWaRD, particularly in the case of batch updates with missing information (nulls). It will also important to understand the performance of FoRWaRD (in comparison to alternatives) on machine-learning tasks other than column prediction: record linking [13,29], entity resolution [10,21], data imputation [22,41], data cleaning [1,36,41] and so on.…”
Section: Discussionmentioning
confidence: 99%
“…Correlation Clustering [12] solves an optimization task, where the goal is to maximize the sum of the intra-cluster edges, while minimizing the sum of the inter-cluster ones. This is an NP-hard problem that is typically solved through approximations, such as Clustering Aggregation [73] and Restricted Correlation Clustering [109]. The latter is a semi-supervised approach that leverages a small labeled dataset, which is carefully selected via an efficient sampling procedure based on LSH.…”
Section: Clustering Methodsmentioning
confidence: 99%
“…The latter can undeniably cause jumping. That brings us to the conclusion that Stadion should not be used, with this formulation, for applications with large K w.r.t N , for instance deduplication [54], and that further investigations are needed in this context.…”
Section: B1 Importance Study With Fanovamentioning
confidence: 98%