Abstract. In this paper, a fast k nearest neighbors (k-NN) classifier for documents is presented. Documents are usually represented in a highdimensional feature space, where terms appeared on it are treated as features and the weight of each term reflects its importance in the document. There are many approaches to find the vicinity of an object, but their performance drastically decreases as the number of dimensions grows. This problem prevents its application for documents. The proposed method is based on a graph index structure with a fast search algorithm. It's high selectivity permits to obtain a similar classification quality than exhaustive classifier, with a few number of computed distances. Our experimental results show that it is feasible the use of the proposed method in problems of very high dimensionality, such as Text Mining.
Access methods are a fundamental tool on Information Retrieval. However, most of these methods suffer the problem known as the curse of dimensionality when they are applied to objects with very high dimensionality representation spaces, such as text documents. In this paper we introduce a new parallel access method that uses several graphs as distributed index structure and a kN N search algorithm. Two parallel versions of the search method are presented, one based on master-slave scheme and the other based on a pipeline. A thorough experimental analysis on different datasets shows that our method can process efficiently large flows of queries, compete with other parallel algorithms and obtain at the same time very high quality results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.