In database records duplicate detection, blocking method is commonly used to reduce the number of comparisons between the candidate record pairs. The main procedure in this method requires selecting attributes that will be used as sorting keys. Selection accuracy is essential in clustering candidates records that are likely matched in the same block. Nevertheless, the presence of missing values affects the creation of sorting keys and this is particularly undesirable if it involves the attributes that are used as the sorting keys. This is because, consequently, records that are supposed to be included in the duplicate detection procedure will be excluded from being examined. Thus, in this paper, we propose a method that can deal with the impact of missing values by using a dynamic sorting key. Dynamic sorting is an extension of blocking method that essentially works on two functions namely uniqueness calculation function (UF) (to choose unique attributes) and completeness function (CF) (to search for missing values). We experimented a particular blocking method called as sorted neighborhood with a dynamic sorting key on a restaurant data set (that consists of duplicate records) obtained from earlier research in order to evaluate the method's accuracy and speed. Hypothetical missing values were applied to testing data set used in the experiment, where we compare the results of duplicate detection with (and without) dynamic sorting key. The result shows that, even though missing values are present, there is a promising improvement in the partitioning of duplicate records in the same block.
Missing values or incomplete data is a common problem that occurs in many applications. In most cases, recovering missing values from data sets is necessary to avoid bias conclusions made by omitting missing values. Missing values recovery (that is also known as missing values imputation) is an important research subject in the field of statistics and data mining. In this paper, we present the Enhanced Robust Association Rules (ERAR)method to extract useful association rules and avoid redundant rules. We show the enhancement made on ERAR to improve the imputation performed by the original Robust Association Rules (RAR). ERAR is designed in selecting the frequent items in datasets that are only related to missing values. Therefore, unnecessary frequent items can be ignored in generating the association rules. The result of the experiment shows that ERAR offers better performance in terms of the time taken for the imputation process and the amount of memory used to complete the imputation. In particular, ERAR behaves better in a monotone pattern of missing values than the arbitrary pattern. In terms of imputation accuracy, we found that both ERAR and RAR exhibit a decreasing rate of accuracy as the amount of missing values increases for data of arbitrary pattern, but this is not the case of data of the monotone pattern. With the findings, ERAR contributes to improving how one can deal with incomplete data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.