Identifying and handling replicas are important to guarantee the quality of the information made available by modern data storage services. There has been a large investment from companies and governments in the development of effective methods for removing replicas from large databases. Typically, this investment has produced significant results, since cleaned replica-free databases not only allow the retrieval of higher-quality information but also lead to a more concise data representation and to potential savings in computational time and resources to process and maintaining this data. In this paper, we propose a GP-based approach to automatic replica identification that combines evidence based on the data content in order to find a similarity function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an SVM-based method used as baseline by at least 6.5%. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our approach is capable to automatically adapt to any given replica identification boundary.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.