Despite their known weaknesses, hidden Markov models (HMMs) have been the dominant technique for acoustic modeling in speech recognition for over two decades. Still, the advances in the HMM framework have not solved its key problems: it discards information about time dependencies and is prone to overgeneralization. In this paper, we attempt to overcome these problems by relying on straightforward template matching. The basis for the recognizer is the well-known DTW algorithm. However, classical DTW continuous speech recognition results in an explosion of the search space. The traditional top-down search is therefore complemented with a data-driven selection of candidates for DTW alignment. We also extend the DTW framework with a flexible subword unit mechanism and a class sensitive distance measure-two components suggested by state-of-the-art HMM systems. The added flexibility of the unit selection in the template-based framework leads to new approaches to speaker and environment adaptation. The template matching system reaches a performance somewhat worse than the best published HMM results for the Resource Management benchmark, but thanks to complementarity of errors between the HMM and DTW systems, the combination of both leads to a decrease in word error rate with 17% compared to the HMM results.
State-of-the-art speech recognition relies on a state-dependent distance measure. In HMM systems, the distance measure is trained into state-dependent covariance matrices using a maximum likelihood or discriminative criterion. This "automatic" adjustment of the distance measure is traditionally considered an inherent advantage of HMMs over DTW recognizers, as those typically rely on a uniform Euclidean distance. In this paper we show how to incorporate a non-uniform weighted distance measure into an examplebased recognition system. By doing so we manage to combine the superior segmental behaviour of DTW with the near-optimal acoustic distance measure as found in HMMs. The non-uniform distance measure enforces modifications to the k nearest neighbours search, an essential component in our large vocabulary DTW approach. We show that the complexity of our solution remains within bounds. The validity of the full approach is verified by experimental results on the Resource Management and TIDigits tasks.
Example based speech recognition is critically dependent on the quality of the acoustic distance measure between input and reference vectors. In the past, the commonly used Euclidean distance has been refined to take into account the covariance of the different sounds, resulting in a class dependent distance measure. However, using the same measure for the whole class is still too crude: vectors in the tails of the distribution (outliers) are unduly considered equally representative of the class as those in the centre.In this paper, we derive two techniques inspired by non-parametric density estimation that explicitly adjust the distance measure based on the position of the reference vector in its class. Experiments on three low-level acoustic tasks show that "data sharpening" results in a substantial improvement, while "adaptive kernels" have minimal effect.
The dominant acoustic modeling methodology based on Hidden Markov Models is known to have certain weaknesses. Partial solutions to these flaws have been presented, but the fundamental problem remains: compression of the data to a compact HMM discards useful information such as time dependencies and speaker information. In this paper, we look at pure example based recognition as a solution to this problem. By replacing the HMM with the underlying examples, all information in the training data is retained. We show how information about speaker and environment can be used, introducing a new interpretation of adaptation. The basis for the recognizer is the wellknown DTW algorithm, which has often been used for small tasks. However, large vocabulary speech recognition introduces new demands, resulting in an explosion of the search space. We show how this problem can be tackled using a data driven approach which selects appropriate speech examples as candidates for DTW-alignment.
In this paper we investigate the behaviour of different acoustic distance measures for template based speech recognition in light of the combination of acoustic distances, linguistic knowledge and template concatenation fluency costs. To that end, different acoustic distance measures are compared on tasks with varying levels of fluency/linguistic constraints. We show that the adoption of those constraints invariably results in an acoustically clearly suboptimal template sequence being chosen as the winning hypothesis. There are strong implications for the design of acoustic distance measures: distance measures that are optimal for frame based classification may prove to be suboptimal for full sentence recognition. In particular, we show this is the case when comparing the Euclidean and the recently introduced adaptive kernel local Mahalanobis distance measures.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.