Computing the Expected Edit Distance from a String to a Probabilistic Finite-State Automaton

Calvo-Zaragoza, Jorge; Oncina, José; Higuera, Colin de la

doi:10.1142/s0129054117400093

Cited by 3 publications

(7 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this adaptation, the prototype generation stage of the RHC algorithm required the computation of the median value of a set of strings. Given that this computation is an NP-hard problem (Calvo-Zaragoza et al 2017a), the median computation was tackled by considering a set-median strategy instead.…”

Section: Background In Data Reductionmentioning

confidence: 99%

“…In spite of its conceptual simplicity, the computation of the median value in the string domain still constitutes an open research question owing to the fact that it is an NP-complete problem (Calvo-Zaragoza et al 2017a). This signifies that while works such as that of Kruskal (1983) propose strategies for the exact median calculation of this median value in the string domain, its applicability is severely conditioned by its extremely low efficiency.…”

Section: Background In Data Reductionmentioning

confidence: 99%

“…This approach is based on recursively dividing the initial corpus into homogeneous clusters in order to then replace each of them with a representative prototype generated as the median element of the cluster. However, as the computation of the exact median element from a set of string data is known in the literature as an NP-complete problem (Calvo-Zaragoza et al 2017a), the work resorted to the use of the set-median value of each cluster, i.e., selecting that median string which minimizes the sum of the distances to the remaining elements in the set.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

2021

Self Cite

View full text Add to dashboard Cite

The k-nearest neighbor (kNN) rule is one of the best-known distance-based classifiers, and is usually associated with high performance and versatility as it requires only the definition of a dissimilarity measure. Nevertheless, kNN is also coupled with low-efficiency levels since, for each new query, the algorithm must carry out an exhaustive search of the training data, and this drawback is much more relevant when considering complex structural representations, such as graphs, trees or strings, owing to the cost of the dissimilarity metrics. This issue has generally been tackled through the use of data reduction (DR) techniques, which reduce the size of the reference set, but the complexity of structural data has historically limited their application in the aforementioned scenarios. A DR algorithm denominated as reduction through homogeneous clusters (RHC) has recently been adapted to string representations but as obtaining the exact median value of a set of string data is known to be computationally difficult, its authors resorted to computing the set-median value. Under the premise that a more exact median value may be beneficial in this context, we, therefore, present a new adaptation of the RHC algorithm for string data, in which an approximate median computation is carried out. The results obtained show significant improvements when compared to those of the set-median version of the algorithm, in terms of both classification performance and reduction rates.

show abstract

Section: Background In Data Reductionmentioning

confidence: 99%

Section: Background In Data Reductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…As aforementioned, this algorithm replaces same-class subsets of prototypes by new elements generated by estimating their median value. Thus, the main issue to tackle is the actual retrieval of the median value of a group of strings, which in our case we resort to the set median as the calculus of the exact median string constitutes an NP-hard problem [18]. Additionally, in order to compare the performance of RHC strategy in both statistical and structural spaces, we make use of the Dissimilarity Space (DS) technique [19] to map the initial strings representation onto a feature-based codification so that additional conclusions can be gathered.…”

Section: Introductionmentioning

confidence: 99%

“…The main consideration in this design is the actual computation of the median string. As it has been introduced, the retrieval of the exact median value of a set of strings is known to be a NP-hard problem [18]. Thus, in this case we consider the set-median operation due to its lower complexity.…”

mentioning

confidence: 99%

Data Reduction in the String Space for Efficient kNN Classification Through Space Partitioning

Valero-Mas¹,

Castellanos

2020

Applied Sciences

View full text Add to dashboard Cite

Within the Pattern Recognition field, two representations are generally considered for encoding the data: statistical codifications, which describe elements as feature vectors, and structural representations, which encode elements as high-level symbolic data structures such as strings, trees or graphs. While the vast majority of classifiers are capable of addressing statistical spaces, only some particular methods are suitable for structural representations. The kNN classifier constitutes one of the scarce examples of algorithms capable of tackling both statistical and structural spaces. This method is based on the computation of the dissimilarity between all the samples of the set, which is the main reason for its high versatility, but in turn, for its low efficiency as well. Prototype Generation is one of the possibilities for palliating this issue. These mechanisms generate a reduced version of the initial dataset by performing data transformation and aggregation processes on the initial collection. Nevertheless, these generation processes are quite dependent on the data representation considered, being not generally well defined for structural data. In this work we present the adaptation of the generation-based reduction algorithm Reduction through Homogeneous Clusters to the case of string data. This algorithm performs the reduction by partitioning the space into class-homogeneous clusters for then generating a representative prototype as the median value of each group. Thus, the main issue to tackle is the retrieval of the median element of a set of strings. Our comprehensive experimentation comparatively assesses the performance of this algorithm in both the statistical and the string-based spaces. Results prove the relevance of our approach by showing a competitive compromise between classification rate and data reduction.

show abstract

Bounds and Estimates on the Average Edit Distance

Schimd

Bilardi

2019

String Processing and Information Retrieval

View full text Add to dashboard Cite

Computing the Expected Edit Distance from a String to a Probabilistic Finite-State Automaton

Cited by 3 publications

References 21 publications

Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

Prototype generation in the string space via approximate median for data reduction in nearest neighbor classification

Data Reduction in the String Space for Efficient kNN Classification Through Space Partitioning

Bounds and Estimates on the Average Edit Distance

Contact Info

Product

Resources

About