On the suitability of Prototype Selection methods for kNN classification with distributed data

Valero-Mas, Jose J.; Calvo-Zaragoza, Jorge; Rico-Juan, Juan Ramón

doi:10.1016/j.neucom.2016.04.018

Cited by 14 publications

(7 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we decided to use an additional metric that relates the performance and degree of reduction: the so-called estimated profit per prototype (Valero-Mas et al 2016). This measure is defined as the ratio between the classification rate and the number of distances computed or, in this context, the number of elements in the training set.…”

Section: Metricsmentioning

confidence: 99%

Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

2021

Self Cite

View full text Add to dashboard Cite

The k-nearest neighbor (kNN) rule is one of the best-known distance-based classifiers, and is usually associated with high performance and versatility as it requires only the definition of a dissimilarity measure. Nevertheless, kNN is also coupled with low-efficiency levels since, for each new query, the algorithm must carry out an exhaustive search of the training data, and this drawback is much more relevant when considering complex structural representations, such as graphs, trees or strings, owing to the cost of the dissimilarity metrics. This issue has generally been tackled through the use of data reduction (DR) techniques, which reduce the size of the reference set, but the complexity of structural data has historically limited their application in the aforementioned scenarios. A DR algorithm denominated as reduction through homogeneous clusters (RHC) has recently been adapted to string representations but as obtaining the exact median value of a set of string data is known to be computationally difficult, its authors resorted to computing the set-median value. Under the premise that a more exact median value may be beneficial in this context, we, therefore, present a new adaptation of the RHC algorithm for string data, in which an approximate median computation is carried out. The results obtained show significant improvements when compared to those of the set-median version of the algorithm, in terms of both classification performance and reduction rates.

show abstract

Section: Metricsmentioning

confidence: 99%

Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…Some research has focused on this issue to address the problem. For example, Valero-Mas et al [16] have proposed the prototype selection strategies that could be used to develop KNN classification for distributed data.…”

Section: Related Workmentioning

confidence: 99%

A New K-Nearest Neighbors Classifier for Big Data Based on Efficient Data Pruning

et al. 2020

View full text Add to dashboard Cite

The K-nearest neighbors (KNN) machine learning algorithm is a well-known non-parametric classification method. However, like other traditional data mining methods, applying it on big data comes with computational challenges. Indeed, KNN determines the class of a new sample based on the class of its nearest neighbors; however, identifying the neighbors in a large amount of data imposes a large computational cost so that it is no longer applicable by a single computing machine. One of the proposed techniques to make classification methods applicable on large datasets is pruning. LC-KNN is an improved KNN method which first clusters the data into some smaller partitions using the K-means clustering method; and then applies the KNN for each new sample on the partition which its center is the nearest one. However, because the clusters have different shapes and densities, selection of the appropriate cluster is a challenge. In this paper, an approach has been proposed to improve the pruning phase of the LC-KNN method by taking into account these factors. The proposed approach helps to choose a more appropriate cluster of data for looking for the neighbors, thus, increasing the classification accuracy. The performance of the proposed approach is evaluated on different real datasets. The experimental results show the effectiveness of the proposed approach and its higher classification accuracy and lower time cost in comparison to other recent relevant methods.

show abstract

“…Finally, in order to provide a single value which relates both the performance and reduction capabilities of the strategies considered, we also consider the estimated profit per prototype measure defined as the ratio between the classification accuracy and the total number of distances computed [30]. It must be mentioned that, for its use in this work, this metric was slightly adapted from its original definition by considering the F 1 instead of the classification accuracy as well as the resulting set size instead of the number of distances computed.…”

Section: Performance Measurementmentioning

confidence: 99%

Data Reduction in the String Space for Efficient kNN Classification Through Space Partitioning

Valero-Mas¹,

Castellanos

2020

Applied Sciences

Self Cite

View full text Add to dashboard Cite

Within the Pattern Recognition field, two representations are generally considered for encoding the data: statistical codifications, which describe elements as feature vectors, and structural representations, which encode elements as high-level symbolic data structures such as strings, trees or graphs. While the vast majority of classifiers are capable of addressing statistical spaces, only some particular methods are suitable for structural representations. The kNN classifier constitutes one of the scarce examples of algorithms capable of tackling both statistical and structural spaces. This method is based on the computation of the dissimilarity between all the samples of the set, which is the main reason for its high versatility, but in turn, for its low efficiency as well. Prototype Generation is one of the possibilities for palliating this issue. These mechanisms generate a reduced version of the initial dataset by performing data transformation and aggregation processes on the initial collection. Nevertheless, these generation processes are quite dependent on the data representation considered, being not generally well defined for structural data. In this work we present the adaptation of the generation-based reduction algorithm Reduction through Homogeneous Clusters to the case of string data. This algorithm performs the reduction by partitioning the space into class-homogeneous clusters for then generating a representative prototype as the median value of each group. Thus, the main issue to tackle is the retrieval of the median element of a set of strings. Our comprehensive experimentation comparatively assesses the performance of this algorithm in both the statistical and the string-based spaces. Results prove the relevance of our approach by showing a competitive compromise between classification rate and data reduction.

show abstract

On the suitability of Prototype Selection methods for kNN classification with distributed data

Cited by 14 publications

References 34 publications

Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

A New K-Nearest Neighbors Classifier for Big Data Based on Efficient Data Pruning

Data Reduction in the String Space for Efficient kNN Classification Through Space Partitioning

Contact Info

Product

Resources

About