Speech recognizer-based microphone array processing for robust hands-free speech recognition

Seltzer, Michael L.; Raj, Bhiksha; Stern, Richard M.

doi:10.1109/icassp.2002.5743884

Cited by 33 publications

(26 citation statements)

References 5 publications

(4 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unfortunalty, it requires knowing the clean speech which is not avalaible in most practical applications. Further improvements in recognition accuracy can be obtained at lower signal-to-noise ratios by the use of multiple microphones (Silverman et al, 1997) (Seltzer, 2003).…”

Section: Classical Proposed Solutions In Speech Recognition Robustnessmentioning

confidence: 99%

Evolutionary Speech Recognition

Spalanzani¹

2007

Robust Speech Recognition and Understanding

View full text Add to dashboard Cite

Section: Classical Proposed Solutions In Speech Recognition Robustnessmentioning

confidence: 99%

Evolutionary Speech Recognition

Spalanzani¹

2007

Robust Speech Recognition and Understanding

View full text Add to dashboard Cite

“…It can be shown [11] that when the HMM state distributions are modeled as mixtures of Gaussians, the gradient expression can be expressed as (13) where represents the a posteriori probability of the th mixture component of state , given . Comparing (11) and (13), it is clear that the gradient expression in the Gaussian mixture case is simply a weighted sum of the gradients of each of the Gaussian components in the mixture, where the weight on each mixture component represents its a posteriori probability of generating the observed feature vector.…”

Section: ) Gaussian State Output Distributionsmentioning

confidence: 99%

“…The full derivation of the Jacobian matrix for log mel spectral or cepstral features can be found in [11].…”

Section: ) Gaussian State Output Distributionsmentioning

confidence: 99%

Subband Likelihood-Maximizing Beamforming for Speech Recognition in Reverberant Environments

Seltzer

Stern

2006

IEEE Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Abstract-Speech recognition performance degrades significantly in distant-talking environments, where the speech signals can be severely distorted by additive noise and reverberation. In such environments, the use of microphone arrays has been proposed as a means of improving the quality of captured speech signals. Currently, microphone-array-based speech recognition is performed in two independent stages: array processing and then recognition. Array processing algorithms, designed for signal enhancement, are applied in order to reduce the distortion in the speech waveform prior to feature extraction and recognition. This approach assumes that improving the quality of the speech waveform will necessarily result in improved recognition performance and ignores the manner in which speech recognition systems operate. In this paper a new approach to microphone-array processing is proposed in which the goal of the array processing is not to generate an enhanced output waveform but rather to generate a sequence of features which maximizes the likelihood of generating the correct hypothesis. In this approach, called likelihood-maximizing beamforming, information from the speech recognition system itself is used to optimize a filter-and-sum beamformer. Speech recognition experiments performed in a real distant-talking environment confirm the efficacy of the proposed approach.

show abstract

“…The beamforming algorithms presented in this paper have been studied in great detail and have found to be effective on varied data [8]. Nevertheless, they can only be considered preliminary -they are computationally expensive, and in the case of speaker separation make the rather serious assumption that word sequences uttered by the speakers are known.…”

Section: Discussionmentioning

confidence: 99%

Speech Recognizer Based Maximum Likelihood Beamforming

Raj

Seltzer

Reyes-Gomez

Speech Separation by Humans and Machines

Self Cite

View full text Add to dashboard Cite

In this paper we present a speech-recognizer-based maximum-likelihood beamforming technique, that can be used both for signal enhancement and speaker separation. The presented techniques uses an HMM-based speech recognizer as a statistical model for the target signal to be enhanced or separated. The parameters of a filter-and-sum array processor are estimated to maximize the likelihood of the output as measured using the speech recognizer. The filter-andsum operation may be performed either in the time domain or the frequency domain. When used for speaker separation, the beamforming must be performed individually for each of the speakers. Since the competing signal is also in-domain speech in this case, the statistical model used for the beamforming is now a factorial HMM formed from the HMM for the target, and that for the competing speakers(s).This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. AbstractIn this paper we present a speech-recognizer-based maximumlikelihood beamforming technique, that can be used both for signal enhancement and speaker separation. The presented technique uses an HMM-based speech recognizer as a statistical model for the target signal to be enhanced or separated. The parameters of a filter-and-sum array processor are estimated to maximize the likelihood of the output as measured using the speech recognizer. The filter-and-sum operation may be performed either in the time domain or the frequency domain. When used for speaker separation, the beamforming must be performed individually for each of the speakers. Since the competing signal is also in-domain speech in this case, the statistical model used for the beamforming is now a factorial HMM formed from the HMM for the target, and that for the competing speaker(s).

show abstract

Speech recognizer-based microphone array processing for robust hands-free speech recognition

Cited by 33 publications

References 5 publications

Evolutionary Speech Recognition

Evolutionary Speech Recognition

Subband Likelihood-Maximizing Beamforming for Speech Recognition in Reverberant Environments

Speech Recognizer Based Maximum Likelihood Beamforming

Contact Info

Product

Resources

About