Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

Castellanos, Francisco J.; Valero-Mas, Jose J.; Calvo-Zaragoza, Jorge

doi:10.1007/s00500-021-06178-2

Cited by 9 publications

(4 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Not only was the RHC algorithm was found to be much faster than RSP3, but it also was one of the fastest approaches that took part in this experimental study [10]. A modified version of the RHC algorithm has recently been applied on string data spaces [11,12].…”

Section: Related Workmentioning

confidence: 95%

Fast Training Set Size Reduction Using Simple Space Partitioning Algorithms

et al. 2022

View full text Add to dashboard Cite

The Reduction by Space Partitioning (RSP3) algorithm is a well-known data reduction technique. It summarizes the training data and generates representative prototypes. Its goal is to reduce the computational cost of an instance-based classifier without penalty in accuracy. The algorithm keeps on dividing the initial training data into subsets until all of them become homogeneous, i.e., they contain instances of the same class. To divide a non-homogeneous subset, the algorithm computes its two furthest instances and assigns all instances to their closest furthest instance. This is a very expensive computational task, since all distances among the instances of a non-homogeneous subset must be calculated. Moreover, noise in the training data leads to a large number of small homogeneous subsets, many of which have only one instance. These instances are probably noise, but the algorithm mistakenly generates prototypes for these subsets. This paper proposes simple and fast variations of RSP3 that avoid the computationally costly partitioning tasks and remove the noisy training instances. The experimental study conducted on sixteen datasets and the corresponding statistical tests show that the proposed variations of the algorithm are much faster and achieve higher reduction rates than the conventional RSP3 without negatively affecting the accuracy.

show abstract

Section: Related Workmentioning

confidence: 95%

Fast Training Set Size Reduction Using Simple Space Partitioning Algorithms

et al. 2022

View full text Add to dashboard Cite

show abstract

“…The considered multiclass PG strategies-the Chen method as well as the different RSP versions-constitute representative examples of the so-called space splitting policy [29], which typically follows a two-step approach: a first stage, space partitioning, divides the feature space of the multiclass set T mc into different regions using certain heuristics; after that, the prototype merging stage computes new prototypes from each region attending to different criteria, producing the reduced set R mc . The existing PG strategies under this framework, therefore, essentially differ in the particular splitting and prototype generation heuristics considered.…”

Section: Reference Multiclass Pgmentioning

confidence: 99%

Multilabel Prototype Generation for Data Reduction in k-Nearest Neighbour classification

Valero-Mas¹,

Gallego²,

Alonso-Jiménez³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Prototype Generation (PG) methods are typically considered for improving the efficiency of the k-Nearest Neighbour (kNN) classifier when tackling high-size corpora. Such approaches aim at generating a reduced version of the corpus without decreasing the classification performance when compared to the initial set. Despite their large application in multiclass scenarios, very few works have addressed the proposal of PG methods for the multilabel space. In this regard, this work presents the novel adaptation of four multiclass PG strategies to the multilabel case. These proposals are evaluated with three multilabel kNN-based classifiers, 12 corpora comprising a varied range of domains and corpus sizes, and different noise scenarios artificially induced in the data. The results obtained show that the proposed adaptations are capable of significantly improvingboth in terms of efficiency and classification performance-the only reference multilabel PG work in the literature as well as the case in which no PG method is applied, also presenting a statistically superior robustness in noisy scenarios.Moreover, these novel PG strategies allow prioritising either the efficiency or efficacy criteria through its configuration depending on the target scenario, hence covering a wide area in the solution space not previously filled by other works.

show abstract

“…In [21] randomized trees, support vector machines and random forests were used for string similarity evaluation and increased accuracy was obtained on a large dataset as compared with classic approaches such as Jaro-Winkler and Damerau-Levenshtein approaches. Moreover, in [27] an unsupervised machine learning approach is used for data reduction in string space.…”

Section: String Similarities In Large Datasetsmentioning

confidence: 99%

GPU Based Similarity Metrics Computation and Machine Learning Approaches for String Similarity Evaluation in Large Datasets

Baloi

Belean

Turcu

et al. 2022

Preprint

View full text Add to dashboard Cite

The digital era brings up on one hand massive amounts of available data, and on the other hand the need of parallel computing architectures for efficient data processing. String similarity evaluation is a processing task applied on large data volumes, commonly performed by various applications such as search engines, bio-medical data analysis and even software tools for defending against viruses, spyware, or spam. String similarities are also evaluated in musical industry for matching playlist records with repertory records composed of song titles, performer artists and producers names, aiming to assure copyright protection of massmedia broadcast materials. Thus, the present paper proposes a GPU based approach for parallel implementation of the Jaro-Winkler string similarity metric computation. Further on, a thresholding-based algorithm is also implemented using GPU for matching records over large datasets. The global GPU RAM memory is used to store multiple string lines as raw data. In the case of a single string, its comparisons with the raw data are performed using the maximum number of available GPU threads and the stride operations. Moreover, based on the computed similarity metrics, an adaptive neural network approach guided by a novelty detection classifier together with a naive neural network implementation are proposed to increase the accuracy of the records matching procedure. Timing considerations and the computational complexity are detailed for the proposed approaches compared with state-of-the-art CPU and GPU approaches. A speed-up factor of 21.6 was obtained for the GPU based JaroWinkler implementation compared with the general purpose processor one, whereas improved accuracy for the records matching procedure was delivered using machine learning approaches.

show abstract

Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

Cited by 9 publications

References 34 publications

Fast Training Set Size Reduction Using Simple Space Partitioning Algorithms

Fast Training Set Size Reduction Using Simple Space Partitioning Algorithms

Multilabel Prototype Generation for Data Reduction in k-Nearest Neighbour classification

GPU Based Similarity Metrics Computation and Machine Learning Approaches for String Similarity Evaluation in Large Datasets

Contact Info

Product

Resources

About