2018
DOI: 10.14569/ijacsa.2018.090979
|View full text |Cite
|
Sign up to set email alerts
|

Duplicates Detection Within Incomplete Data Sets Using Blocking and Dynamic Sorting Key Methods

Abstract: In database records duplicate detection, blocking method is commonly used to reduce the number of comparisons between the candidate record pairs. The main procedure in this method requires selecting attributes that will be used as sorting keys. Selection accuracy is essential in clustering candidates records that are likely matched in the same block. Nevertheless, the presence of missing values affects the creation of sorting keys and this is particularly undesirable if it involves the attributes that are used… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
1

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 12 publications
0
5
0
Order By: Relevance
“…To improve the efficiency of the collection methods, the data preparation stage is followed by compensating the missing values using one of the clustering methods to deal with the incompleteness problem [24]. The impact of missing values on data clustering were discussed in [25].…”
Section: Canopy Clustering Blockingmentioning
confidence: 99%
See 1 more Smart Citation
“…To improve the efficiency of the collection methods, the data preparation stage is followed by compensating the missing values using one of the clustering methods to deal with the incompleteness problem [24]. The impact of missing values on data clustering were discussed in [25].…”
Section: Canopy Clustering Blockingmentioning
confidence: 99%
“…This stage plays a major role in generating strong sort keys that increases the probability of detecting true duplicates, as well as avoiding attributes that give a false sense of duplicate as attributes (gender or zip-code). Also, it reduces the cost of detecting duplications by making the matching process among the selected attribute values only instead of comparing all the attributes values of the dataset (for more see [25]).…”
Section: Attributes Selectionmentioning
confidence: 99%
“…Detecting duplicates within incomplete data sets pose a unique challenge [11][12][13][14][15]. This is because missing values that are present within records will make these records look unique although they refer to the same real-world entity.…”
Section: The Challenge Of Detecting Duplicatesmentioning
confidence: 99%
“…These two data sets underwent a series of changes as reported in Draisbach and Naumann (see [40]) in addition to the changes mentioned in [15]. The unique identifiers that were added by the authors had been removed so that the creation of the dynamic sorting keys was not affected by them, as these identifiers did not belong to the actual data sets.…”
Section: Data Setmentioning
confidence: 99%
“…Only 3% of the records are duplicates. These two data sets have been undergone a series of changes as reported in Draisbach and Naumann (2010) (see [40]) in addition to the changes mentioned in [15]. The unique identifiers that were added by the authors have been removed so that the creation of the dynamic sorting keys is not affected by them, as these identifiers do not belong to the actual data sets.…”
Section: Data Setmentioning
confidence: 99%