Tandem connectionist feature extraction for conventional HMM systems

Heřmanský, Hynek; Ellis, Dan; Sharma, Sanjay

doi:10.1109/icassp.2000.862024

Cited by 529 publications

(414 citation statements)

References 10 publications

Supporting

Mentioning

399

Contrasting

Unclassified

Order By: Relevance

“…It is worth mentioning that KL-HMM was originally developed from the perspective of acoustic modeling , as an alternative to Tandem approach (Hermansky et al, 2000). However, as shown recently and briefly explained in this section, KL-HMM is a probabilistic modeling approach (Rasipuram and Magimai.-Doss, 2013a,b).…”

Section: Kullback-leibler Divergence Based Hmmmentioning

confidence: 99%

“…We investigate two systems, the first system uses standard cepstral features as feature observations (HMM/GMM system) and the second system uses Tandem features as feature observations (Tandem system) (Hermansky et al, 2000). As indicated in Table 1, Tandem system exploits both language-dependent and languageindependent resources similar to probabilistic lexical model based systems and acoustic model adaptation based systems.…”

Section: Standard Language-dependent Acoustic Model and Lexical Modelmentioning

confidence: 99%

See 1 more Smart Citation

Acoustic and lexical resource constrained ASR using language-independent acoustic model and language-dependent probabilistic lexical model

Rasipuram

Magimai-Doss

2015

Speech Communication

View full text Add to dashboard Cite

One of the key challenge involved in building a statistical automatic speech recognition (ASR) system is modeling the relationship between lexical units (that are based on subword units in the pronunciation lexicon) and acoustic feature observations. To model this relationship two types of resources are needed, namely, acoustic resources (speech signals with word level transcriptions) and lexical resources (which transcribes each word in terms of subword units). Standard ASR systems typically use phonemes or phones as subword units. Not all languages have well developed acoustic resources and phonetic lexical resources. In this paper, we show that modeling of the relationship between lexical units and acoustic features can be factored into two parts through a latent variable, referred to as acoustic units, namely: (a) acoustic model that models the relationship between acoustic features and acoustic units and (b) lexical model that models the relationship between lexical units and acoustic units. Through this understanding, we elucidate that in standard hidden Markov model (HMM) based ASR system, the lexical model is deterministic (i.e., there exists an one-to-one relationship between lexical units and acoustic units), and it is the deterministic lexical model that imposes the need for well developed acoustic and lexical resources in the target language or domain when building ASR system. We then propose an approach that addresses both acoustic resource and lexical resource constraints. More specifically, in the proposed approach the acoustic model models the relationship between acoustic features and multilingual phones (acoustic units) on target language-independent data, and the lexical model models a probabilistic relationship between lexical units based on graphemes and multilingual phones on small amount of target language data. We show the potential and the efficacy of the proposed approach through experiments and comparisons with other approaches on three different ASR tasks, namely, non-native accented speech recognition, rapid development of ASR system for a new language and development of ASR system for a minority language.

show abstract

Section: Kullback-leibler Divergence Based Hmmmentioning

confidence: 99%

Section: Standard Language-dependent Acoustic Model and Lexical Modelmentioning

confidence: 99%

Acoustic and lexical resource constrained ASR using language-independent acoustic model and language-dependent probabilistic lexical model

Rasipuram

Magimai-Doss

2015

Speech Communication

View full text Add to dashboard Cite

show abstract

“…Deep Boltzmann machines have been used as stacked autoencoders for feature extraction [20] or post-processing of local binary patterns from three orthogonal planes (LBP-TOP) [25]. These features are then classified using Support Vector Machines (SVMs) [20], where all utterance lengths have to be normalized, or using a tandem system [13], where the features are passed into a GMM-HMM recognizer [15,21,25]. Similarly, feature extraction has been performed by convolutional neural networks (CNNs) [21,16] and deep belief networks (DBNs) [15].…”

Section: Related Workmentioning

confidence: 99%

Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System

Zimmermann

Ghazi

Ekenel

et al. 2017

Computer Vision – ACCV 2016 Workshops

View full text Add to dashboard Cite

Abstract. Automatic visual speech recognition is an interesting problem in pattern recognition especially when audio data is noisy or not readily available. It is also a very challenging task mainly because of the lower amount of information in the visual articulations compared to the audible utterance. In this work, principle component analysis is applied to the image patches -extracted from the video data -to learn the weights of a two-stage convolutional network. Block histograms are then extracted as the unsupervised learning features. These features are employed to learn a recurrent neural network with a set of long short-term memory cells to obtain spatiotemporal features. Finally, the obtained features are used in a tandem GMM-HMM system for speech recognition. Our results show that the proposed method has outperformed the baseline techniques applied to the OuluVS2 audiovisual database for phrase recognition with the frontal view cross-validation and testing sentence correctness reaching 79% and 73%, respectively, as compared to the baseline of 74% on cross-validation.The final publication is available at Springer via http://dx

show abstract

“…For PLP cepstral features, usually 9 frames of PLP coefficients and their first and second order derivatives are concatenated as the input for a trained MLP to estimate the posterior probabilities of context-independent phones [5]. The phonetic class is defined with respect to the center of 9 frames.…”

Section: Single Stream Posterior Estimationmentioning

confidence: 99%

“…This hierarchical approach provides a new, principled, theoretical framework for combining different streams of features taking into account context and model knowledge. We show that this method gives significant performance improvement over baseline PLP-TANDEM [5] and TRAP-TANDEM [11] techniques and also entropy based combination method [12] on OGI digits [13] and a reduced vocabulary version (1000 words) of CTS [6] databases.…”

Section: Introductionmentioning

confidence: 99%

Hierarchical Multi-stream Posterior Based Speech Recognition System

Ketabdar

Bourlard

Bengio

2006

Machine Learning for Multimodal Interaction

View full text Add to dashboard Cite

Abstract. In this paper, we present initial results towards boosting posterior based speech recognition systems by estimating more informative posteriors using multiple streams of features and taking into account acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). These posteriors are estimated based on "state gamma posterior" definition (typically used in standard HMMs training) extended to the case of multi-stream HMMs.This approach provides a new, principled, theoretical framework for hierarchical estimation/use of posteriors, multi-stream feature combination, and integrating appropriate context and prior knowledge in posterior estimates. In the present work, we used the resulting gamma posteriors as features for a standard HMM/GMM layer. On the OGI Digits database and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task, this resulted in significant performance improvement, compared to the state-of-the-art Tandem systems. 2 IDIAP-RR 05-25

show abstract

Tandem connectionist feature extraction for conventional HMM systems

Cited by 529 publications

References 10 publications

Acoustic and lexical resource constrained ASR using language-independent acoustic model and language-dependent probabilistic lexical model

Acoustic and lexical resource constrained ASR using language-independent acoustic model and language-dependent probabilistic lexical model

Visual Speech Recognition Using PCA Networks and LSTMs in a Tandem GMM-HMM System

Hierarchical Multi-stream Posterior Based Speech Recognition System

Contact Info

Product

Resources

About