The text classification is an important subject in the data mining.For the text classification, several methods have been developed up to now, as the nearest neighbor analysis, the latent semantic analysis etc.The k-nearest neighbor (kNN) classification is a well-known simple and effective method for the classification of data in many domains. In the use of the kNN, the distance function is important to measure the distance and similarity between data. To improve the performance of the classifier by the kNN, a new approach to combine multiple distance functions is proposed here. The weighting factor of elements in the distance function is computed by Genetic Algorithm (GA). Further, an ensemble processing was developed for the improvement of the classification accuracy. Finally, it is shown by experiments that the methods developed here, were effective in the text classification. In this paper, we use tolerant rough set and GA to create the suitable mode and feature weight (weight of distance function) for kNN.To measure the distance of data in the kNN, the distance function is important The most commonly used function is the Euclidean distance function (Euclid), which measures two input vectors (one typically being from a stored instance, and the other being an input vector to be classified). One weakness of the Euclidean distance function is that if one of the input attributes has a relatively large range, then it can overpower the other attributes. Therefore, distances are often normalized by dividing the distance for each attribute by the range (i.e., maximum-minimum) of that attribute. An attribute can be linear or nominal and a linear attribute can be continuous or discrete. A continuous attribute uses real values, such as the mass of an object or the velocity of a car. A linear discrete attribute can have a discrete set of linear values, such as number of children. One way to handle applications with both continuous and nominal attributes is to use a heterogeneous distance function that uses different attribute distance functions on different kinds of attributes. The Heterogeneous Euclidean-Overlap Metric (HEOM) uses the overlap metric for nominal attributes and normalized Euclidean distance for linear attributes.The HEOM removes the effects of the arbitrary ordering of nominal values. The Value Difference Metric (VDM) is an appropriate distance function for nominal attributes. A simplified version of the VDM defines the distance between two values of an attribute as the probability of the number of instances. The weakness of VDM is that it is inappropriate to use the VDM directly on continuous attributes. One approach to the problem of using VDM on continuous attributes is discretization. To overcome the weakness of VDM, the Heterogeneous Value Difference Metric (HVDM) was developed by Wilson and Martinez. The HVDM uses Euclidean distance for linear attributes and VDM for nominal attributes. The other two distance functions are the Interpolated Value Difference Metric (IVMD). In this study, the weighted ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.