Duplicate record is a common problem within data sets especially in huge volume databases. The accuracy of duplicate detection determines the efficiency of duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. In this paper, duplicate detection improvement was proposed despite the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. The missing values were hypothetically added to the key attributes of three data sets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. The results were analyzed, then, the performance of duplicate detection was evaluated by using the Hot Deck method to compensate for the missing values in the key attributes. It was hypothesized that by using Hot Deck, duplicate detection performance would be improved. Furthermore, the DDID performance was compared to an early duplicate detection method namely DuDe, in terms of its accuracy and speed. The findings yielded that even though the data sets were incomplete, DDID was able to offer a better accuracy and faster duplicate detection as compared to DuDe. The results of this study offer insights into constraints of duplicate detection within incomplete data sets.
In database records duplicate detection, blocking method is commonly used to reduce the number of comparisons between the candidate record pairs. The main procedure in this method requires selecting attributes that will be used as sorting keys. Selection accuracy is essential in clustering candidates records that are likely matched in the same block. Nevertheless, the presence of missing values affects the creation of sorting keys and this is particularly undesirable if it involves the attributes that are used as the sorting keys. This is because, consequently, records that are supposed to be included in the duplicate detection procedure will be excluded from being examined. Thus, in this paper, we propose a method that can deal with the impact of missing values by using a dynamic sorting key. Dynamic sorting is an extension of blocking method that essentially works on two functions namely uniqueness calculation function (UF) (to choose unique attributes) and completeness function (CF) (to search for missing values). We experimented a particular blocking method called as sorted neighborhood with a dynamic sorting key on a restaurant data set (that consists of duplicate records) obtained from earlier research in order to evaluate the method's accuracy and speed. Hypothetical missing values were applied to testing data set used in the experiment, where we compare the results of duplicate detection with (and without) dynamic sorting key. The result shows that, even though missing values are present, there is a promising improvement in the partitioning of duplicate records in the same block.
The rapid growth of open data sources is driven by free-of-charge contents and ease of accessibility. While it is convenient for public data consumers to use data sets extracted from open data sources, the decision to use these data sets should be based on data sets' quality. Several data quality dimensions such as completeness, accuracy, and timeliness are common requirements to make data fit for use. More importantly, in many cases, high-quality data sets are desirable in ensuring reliable outcomes of reports and analytics. Even though many open data sources provide data quality guidelines, the responsibility to ensure data of high quality requires commitment from data contributors. In this paper, an initial investigation on the quality of open data sets in terms of completeness dimension was conducted. In particular, the results of the missing values in 20 open data sets measurement were extracted from the open data sources. The analysis covered all the missing values representations which are not limited to nulls or blank spaces. The results exhibited a range of missing values ratios that indicated the level of the data sets completeness. The limited coverage of this analysis does not hinder understanding of the current level of data completeness of open data sets. The findings may motivate open data providers to design initiatives that will empower data quality policy and guidelines for data contributors. In addition, this analysis may assist public data users to decide on the acceptability of open data sets by applying the simple methods proposed in this paper or performing data cleaning actions to improve the completeness of the data sets concerned.
Missing value (MV) is one form of data completeness problem in massive datasets. To deal with missing values, data imputation methods were proposed with the aim to improve the completeness of the datasets concerned. Data imputation's accuracy is a common indicator of a data imputation technique's efficiency. However, the efficiency of data imputation can be affected by the nature of the language in which the dataset is written. To overcome this problem, it is necessary to normalize the data, especially in non-Latin languages such as the Arabic language. This paper proposes a method that will address the challenge inherent in Arabic datasets by extending the enhanced robust association rules (ERAR) method with Arabic detection and correction functions. Iterative and Decision Tree methods were used to evaluate the proposed method in an experiment. Experiment results show that the proposed method offers a higher data imputation accuracy than the Iterative and Decision Tree methods.
Clustering method is a technique used for comparisons reduction between the candidates records in the duplicate detection process. The process of clustering records is affected by the quality of data. The more error-free the data, the more efficient the clustering algorithm, as data errors cause data to be placed in incorrect groups. Window algorithms suffer from the window size. The larger the window, the greater the number of unnecessary comparisons, and the smaller the window size may prevent the detection of duplicates that are supposed to be within the window. In this paper, we propose a data pre-processing method that increases the efficiency of window algorithms in grouping similar records together. In addition, the proposed method also deal s with the window size problem. In the proposed method, high-rank attributes are selected and then preparators are applied to the selected traits. A compensation algorithm is implemented to reduce the problem of missing and distorted sort keys. Two datasets (compact disc database (CDDB) and MusicBrainz) were used to test duplicates detection algorithms. The duplicates detection toolkit(DuDe) was used as a benchmark for the proposed method. Experiments showed that the proposed method achieved a high rate of accuracy in detecting duplicates. In addition, the proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.