Duplicates Detection Within Incomplete Data Sets Using Blocking and Dynamic Sorting Key Methods

Ali, Abdulrazzak; Emran, Nurul A.; Asmai, Siti Azirah; Thabet, Awsan

doi:10.14569/ijacsa.2018.090979

Cited by 3 publications

(5 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To improve the efficiency of the collection methods, the data preparation stage is followed by compensating the missing values using one of the clustering methods to deal with the incompleteness problem [24]. The impact of missing values on data clustering were discussed in [25].…”

Section: Canopy Clustering Blockingmentioning

confidence: 99%

“…This stage plays a major role in generating strong sort keys that increases the probability of detecting true duplicates, as well as avoiding attributes that give a false sense of duplicate as attributes (gender or zip-code). Also, it reduces the cost of detecting duplications by making the matching process among the selected attribute values only instead of comparing all the attributes values of the dataset (for more see [25]).…”

Section: Attributes Selectionmentioning

confidence: 99%

See 1 more Smart Citation

Improving the efficiency of clustering algorithm for duplicates detection

Ali

Emran

Baharin

et al. 2023

IJEECS

View full text Add to dashboard Cite

Clustering method is a technique used for comparisons reduction between the candidates records in the duplicate detection process. The process of clustering records is affected by the quality of data. The more error-free the data, the more efficient the clustering algorithm, as data errors cause data to be placed in incorrect groups. Window algorithms suffer from the window size. The larger the window, the greater the number of unnecessary comparisons, and the smaller the window size may prevent the detection of duplicates that are supposed to be within the window. In this paper, we propose a data pre-processing method that increases the efficiency of window algorithms in grouping similar records together. In addition, the proposed method also deal s with the window size problem. In the proposed method, high-rank attributes are selected and then preparators are applied to the selected traits. A compensation algorithm is implemented to reduce the problem of missing and distorted sort keys. Two datasets (compact disc database (CDDB) and MusicBrainz) were used to test duplicates detection algorithms. The duplicates detection toolkit(DuDe) was used as a benchmark for the proposed method. Experiments showed that the proposed method achieved a high rate of accuracy in detecting duplicates. In addition, the proposed method.

show abstract

Section: Canopy Clustering Blockingmentioning

confidence: 99%

Section: Attributes Selectionmentioning

confidence: 99%

Improving the efficiency of clustering algorithm for duplicates detection

Ali

Emran

Baharin

et al. 2023

IJEECS

View full text Add to dashboard Cite

show abstract

“…Detecting duplicates within incomplete data sets pose a unique challenge [11][12][13][14][15]. This is because missing values that are present within records will make these records look unique although they refer to the same real-world entity.…”

Section: The Challenge Of Detecting Duplicatesmentioning

confidence: 99%

“…These two data sets underwent a series of changes as reported in Draisbach and Naumann (see [40]) in addition to the changes mentioned in [15]. The unique identifiers that were added by the authors had been removed so that the creation of the dynamic sorting keys was not affected by them, as these identifiers did not belong to the actual data sets.…”

Section: Data Setmentioning

confidence: 99%

Missing values compensation in duplicates detection using hot deck method

2021

Self Cite

View full text Add to dashboard Cite

Duplicate record is a common problem within data sets especially in huge volume databases. The accuracy of duplicate detection determines the efficiency of duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. In this paper, duplicate detection improvement was proposed despite the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. The missing values were hypothetically added to the key attributes of three data sets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. The results were analyzed, then, the performance of duplicate detection was evaluated by using the Hot Deck method to compensate for the missing values in the key attributes. It was hypothesized that by using Hot Deck, duplicate detection performance would be improved. Furthermore, the DDID performance was compared to an early duplicate detection method namely DuDe, in terms of its accuracy and speed. The findings yielded that even though the data sets were incomplete, DDID was able to offer a better accuracy and faster duplicate detection as compared to DuDe. The results of this study offer insights into constraints of duplicate detection within incomplete data sets.

show abstract

“…Only 3% of the records are duplicates. These two data sets have been undergone a series of changes as reported in Draisbach and Naumann (2010) (see [40]) in addition to the changes mentioned in [15]. The unique identifiers that were added by the authors have been removed so that the creation of the dynamic sorting keys is not affected by them, as these identifiers do not belong to the actual data sets.…”

Section: Data Setmentioning

confidence: 99%

Missing Values Compensation in Duplicates Detection Using Hot Deck

Ali

Emran

Asmai

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Duplicate record is a known problem within the datasets especially within databases of huge volumes. The accuracy of duplicates detection determines the efficiency of the duplicates removal process. Unfortunately, the effort to detect duplicates becomes more challenging due to the presence of missing values within the records. This is because, during the clustering and matching process, missing values can cause records that are similar to be assigned in a wrong group, causing the duplicates left undetected. In this paper, we present how duplicates detection can be improved even though missing values are present within a data set using our Duplicates Detection within the Incomplete Data set (DDID) method. We hypothetically add the missing values to the key attributes of two datasets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. We analyze the results to evaluate the performance of duplicates detection using the Hot Deck method to compensate for the missing values in the key attributes. We hypothesize that by using Hot Deck, there is a performance improvement in duplicates detection. The performance of the DDID is compared with an early duplicates detection method (called DuDe) in terms of its accuracy and speed. The findings of the experiment show that, even though the data sets are incomplete, DDID is capable to offer better accuracy and faster duplicates detection as compared to a benchmark method (called DuDe). The results of this study contribute to duplicates detection under incomplete data sets constraint.

show abstract

Duplicates Detection Within Incomplete Data Sets Using Blocking and Dynamic Sorting Key Methods

Cited by 3 publications

References 12 publications

Improving the efficiency of clustering algorithm for duplicates detection

Improving the efficiency of clustering algorithm for duplicates detection

Missing values compensation in duplicates detection using hot deck method

Missing Values Compensation in Duplicates Detection Using Hot Deck

Contact Info

Product

Resources

About