Abstract-We analyze a simple hierarchical architecture consisting of two multilayer perceptron (MLP) classifiers in tandem to estimate the phonetic class conditional probabilities. In this hierarchical setup, the first MLP classifier is trained using standard acoustic features. The second MLP is trained using the posterior probabilities of phonemes estimated by the first, but with a long temporal context of around 150-230 ms. Through extensive phoneme recognition experiments, and the analysis of the trained second MLP using Volterra series, we show that (a) the hierarchical system yields higher phoneme recognition accuracies -an absolute improvement of 3.5% and 9.3% on TIMIT and CTS respectively -over the conventional single MLP based system, (b) there exists useful information in the temporal trajectories of the posterior feature space, spanning around 230 ms of context, (c) the second MLP learns the phonetic temporal patterns in the posterior features, which include the phonetic confusions at the output of the first MLP as well as the phonotactics of the language as observed in the training data, and (d) the second MLP classifier requires fewer number of parameters and can be trained using lesser amount of training data.
Automatic speech recognition systems typically model the relationship between the acoustic speech signal and the phones in two separate steps: feature extraction and classifier training. In our recent works, we have shown that, in the framework of convolutional neural networks (CNN), the relationship between the raw speech signal and the phones can be directly modeled and ASR systems competitive to standard approach can be built. In this paper, we first analyze and show that, between the first two convolutional layers, the CNN learns (in parts) and models the phone-specific spectral envelope information of 2-4 ms speech. Given that we show that the CNN-based approach yields ASR trends similar to standard short-term spectral based ASR system under mismatched (noisy) conditions, with the CNN-based approach being more robust.
In hidden Markov model (HMM) based automatic speech recognition (ASR) system, modeling the statistical relationship between the acoustic speech signal and the HMM states that represent linguistically motivated subword units such as phonemes is a crucial step. This is typically achieved by first extracting acoustic features from the speech signal based on prior knowledge such as, speech perception or/and speech production knowledge, and, then training a classifier such as artificial neural networks (ANN), Gaussian mixture model that estimates the emission probabilities of the HMM states. Recent advances in machine learning techniques, more specifically in the field of image processing and text processing, have shown that such divide and conquer strategy (i.e., separating feature extraction and modeling steps) may not be necessary. Motivated from these studies, we propose an end-to-end acoustic modeling approach using convolution neural networks (CNNs), where the CNN takes as input raw speech signal and estimates the HMM states class conditional probabilities at the output. Alternately, in this approach the relevant features and the classifier are jointly learned from the raw speech signal. Through ASR studies and analyses on multiple languages and multiple tasks, we show that: (a) the proposed approach yields consistently a better system with fewer parameters when compared to the conventional approach of cepstral feature extraction followed by ANN training, (b) unlike conventional method of speech processing, in the proposed approach the relevant feature representations are learned by first processing the input raw speech at sub-segmental level (≈ 2 ms). Specifically, through an analysis we show that the filters in the first convolution layer automatically learn "in-parts" formant-like information present in the sub-segmental speech, and (c) the intermediate feature representations obtained by subsequent filtering of the first convolution layer output are more discriminative compared to standard cepstral features and could be transferred across languages and domains.
The so-called tandem approach, where the posteriors of a multilayer perceptron (MLP) classi er are used as features in an automatic speech recognition (ASR) system has proven to be a very effective method. Most tandem approaches up to date have relied on MLPs trained for phone classi cation, and appended the posterior features to some standard feature hidden Markov model (HMM). In this paper, we develop an alternative tandem approach based on MLPs trained for articulatory feature (AF) classi cation. We also develop a factored observation model for characterizing the posterior and standard features at the HMM outputs, allowing for separate hidden mixture and state-tying structures for each factor. In experiments on a subset of Switchboard, we show that the AFbased tandem approach is as effective as the phone-based approach, and that the factored observation model signi cantly outperforms the simple feature concatenation approach while using fewer parameters.
In this paper, we investigate the significance of contextual information in a phoneme recognition system using the hidden Markov model -artificial neural network paradigm. Contextual information is probed at the feature level as well as at the output of the multilayerd perceptron. At the feature level, we analyse and compare different methods to model sub-phonemic classes. To exploit the contextual information at the output of the multilayered perceptron, we propose the hierarchical estimation of phoneme posterior probabilities. The best phoneme (excluding silence) recognition accuracy of 73.4% on the TIMIT database is comparable to that of the state-ofthe-art systems, but more emphasis is on analysis of the contextual information.
One of the key challenge involved in building a statistical automatic speech recognition (ASR) system is modeling the relationship between lexical units (that are based on subword units in the pronunciation lexicon) and acoustic feature observations. To model this relationship two types of resources are needed, namely, acoustic resources (speech signals with word level transcriptions) and lexical resources (which transcribes each word in terms of subword units). Standard ASR systems typically use phonemes or phones as subword units. Not all languages have well developed acoustic resources and phonetic lexical resources. In this paper, we show that modeling of the relationship between lexical units and acoustic features can be factored into two parts through a latent variable, referred to as acoustic units, namely: (a) acoustic model that models the relationship between acoustic features and acoustic units and (b) lexical model that models the relationship between lexical units and acoustic units. Through this understanding, we elucidate that in standard hidden Markov model (HMM) based ASR system, the lexical model is deterministic (i.e., there exists an one-to-one relationship between lexical units and acoustic units), and it is the deterministic lexical model that imposes the need for well developed acoustic and lexical resources in the target language or domain when building ASR system. We then propose an approach that addresses both acoustic resource and lexical resource constraints. More specifically, in the proposed approach the acoustic model models the relationship between acoustic features and multilingual phones (acoustic units) on target language-independent data, and the lexical model models a probabilistic relationship between lexical units based on graphemes and multilingual phones on small amount of target language data. We show the potential and the efficacy of the proposed approach through experiments and comparisons with other approaches on three different ASR tasks, namely, non-native accented speech recognition, rapid development of ASR system for a new language and development of ASR system for a minority language.
Detection and localization of speakers with microphone arrays is a difficult task due to the wideband nature of speech signals, the large amount of overlaps between speakers in spontaneous conversations, and the presence of noise sources. Many existing audio multi-source localization methods rely on prior knowledge of the sectors containing active sources and/or the number of active sources. This paper proposes sector-based, frequency-domain approaches that address both detection and localization problems by measuring relative phases between microphones. The first approach is similar to delay-sum beamforming. The second approach is novel: it relies on systematic optimization of a centroid in phase space, for each sector. It provides major, systematic improvement over the first approach as well as over previous work. Very good results are obtained on more than one hour of recordings in real meeting room conditions, including cases with up to 3 concurrent speakers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.