2015
DOI: 10.1109/tkde.2015.2416734
|View full text |Cite
|
Sign up to set email alerts
|

A Practical and Effective Sampling Selection Strategy for Large Scale Deduplication

Abstract: The data deduplication task has attracted a considerable amount of attention from the research community in order to provide effective and efficient solutions. The information provided by the user to tune the deduplication process is usually represented by a set of manually labeled pairs. In very large datasets, producing this kind of labeled set is a daunting task since it requires an expert to select and label a large number of informative pairs. In this article, we propose a two-stage sampling selection str… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0
2

Year Published

2017
2017
2024
2024

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 24 publications
(11 citation statements)
references
References 24 publications
0
9
0
2
Order By: Relevance
“…A traditional method of producing data blocks is defining one of the attributes as the blocking criterion or key [4]. Only records that have the same blocking key will be inserted into the same block, reducing substantially the number of comparisons.…”
Section: Blocking Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…A traditional method of producing data blocks is defining one of the attributes as the blocking criterion or key [4]. Only records that have the same blocking key will be inserted into the same block, reducing substantially the number of comparisons.…”
Section: Blocking Methodsmentioning
confidence: 99%
“…However, such a method results in a quadratic cost of processing which would be non-practical in the case of a large dataset. In this context, blocking appears as an alternative to reduce the search space by only processing the records that have some indication of representing a duplicate [4].…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Instead, we may consider direct matching of place names, which can be done using deep neural networks [3,15,19,30,40,45,49]. However, these models are known to be more suitable for longer sentences with more complex structures, rather than the short simple place names, and they require large amounts of training data, which is extremely hard to acquire for place deduplication [7,9,31]. Finally, by considering additional place attributes like location, ad-hoc models have been developed based on heuristic feature engineering [8,10,22], but their flexibility is limited to take more different place attributes, such as address and category.…”
Section: Nlp: Entity Resolutionmentioning
confidence: 99%