Exploiting correlogram structure for robust speech recognition with multiple speech sources

Ma, Ning; Green, Phil; Barker, Jon; Coy, André

doi:10.1016/j.specom.2007.05.003

Cited by 41 publications

(25 citation statements)

References 26 publications

(23 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Both quantities are computed using (28) and (29) given the current estimate of the noise model M n .…”

Section: Noise Model Estimationmentioning

confidence: 99%

“…Unlike the above mentioned marginalisation method, the SFD technique carries out both mask estimation and speech recognition at the same time by searching for the optimal segregation mask and HMM state sequence given a set of time-frequency fragments identified prior to the decoding stage. These fragments correspond to patches in the noisy spectrum that are dominated by the energy of an acoustic source [28]. Thus, the SFD approach determines the most likely set of speech fragments among all the possible combinations of source fragments by exploiting knowledge of the speech source provided by the speech models in the recogniser.…”

Section: Comparison With Other Missing-data Techniquesmentioning

confidence: 99%

See 1 more Smart Citation

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

González

Gómez²,

Peinado³

et al. 2017

Circuits Syst Signal Process

Self Cite

View full text Add to dashboard Cite

An effective way to increase noise robustness in automatic speech recognition (ASR) systems is feature enhancement based on an analytical distortion model that describes the effects of noise on the speech features. One of such distortion models that has been reported to achieve a good tradeoff between accuracy and simplicity is the masking model. Under this model, speech distortion caused by environmental noise is seen as a spectral mask and, as a result, noisy speech features can be either reliable (speech is not masked by noise) or unreliable (speech is masked). In this paper we present a detailed overview of this model and its applications to noise-robust ASR. Firstly, using the masking model, we derive a spectral reconstruction technique aimed at enhancing the noisy speech features. Two problems must be solved in order to perform spectral reconstruction using the masking model: i) mask estimation, i.e. determining the reliability of the noisy features, and ii) feature imputation, i.e. estimating speech for the unreliable features. Unlike missing-data imputation techniques where the two problems are considered as independent, our technique jointly addresses them by exploiting a priori knowledge of the speech and noise sources in the form of a statistical model. Secondly, we propose an algorithm for estimating the noise model required by the feature enhancement technique. The proposed algorithm fits a Gaussian mixture model (GMM) to the noise by iteratively maximising the likelihood of the noisy speech signal so that noise can be estimated even during speech-dominating frames. A comprehensive set of experiments carried out on the Aurora-2 and Aurora-4 databases shows that the proposed method achieves significant improvements over the baseline system and other similar missing-data imputation techniques.

show abstract

“…Both quantities are computed using (28) and (29) given the current estimate of the noise model M n .…”

Section: Noise Model Estimationmentioning

confidence: 99%

Section: Comparison With Other Missing-data Techniquesmentioning

confidence: 99%

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

González

Gómez²,

Peinado³

et al. 2017

Circuits Syst Signal Process

Self Cite

View full text Add to dashboard Cite

show abstract

“…A fragment decoding system then attempts to interpret the high-energy regions that are not accounted for by the noise floor model. The first step is to separately generate soft missing data masks (using the adaptive noise tracker) and fragments (using harmonicity-based techniques [36]) from the noisy signals.…”

Section: Combining Sfd and Noise Floor Modelingmentioning

confidence: 99%

“…This work employs techniques for tracking multiple pitches of simultaneous sounds in the autocorrelogram domain and use this information to identify fragments [36]. In brief, a running short-time autocorrelation is computed on the output of each gammatone filter using a 30-ms Hann window.…”

Section: B Fragment Generationmentioning

confidence: 99%

Combining Speech Fragment Decoding and Adaptive Noise Floor Modeling

Barker

Christensen

et al. 2012

IEEE Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Abstract-This paper presents a novel noise-robust automatic speech recognition (ASR) system that combines aspects of the noise modeling and source separation approaches to the problem. The combined approach has been motivated by the observation that the noise backgrounds encountered in everyday listening situations can be roughly characterized as a slowly varying noise floor in which there are embedded a mixture of energetic but unpredictable acoustic events. Our solution combines two complementary techniques. First, an adaptive noise floor model estimates the degree to which high-energy acoustic events are masked by the noise floor (represented by a soft missing data mask). Second, a fragment decoding system attempts to interpret the high-energy regions that are not accounted for by the noise floor model. This component uses models of the target speech to decide whether fragments should be included in the target speech stream or not. Our experiments on the CHiME corpus task show that the combined approach performs significantly better than systems using either the noise model or fragment decoding approach alone, and substantially outperforms multicondition training.Index Terms-Adaptive noise floor modeling, fragment decoding, missing data decoding, noise robust speech recognition.

show abstract

“…Excitation features, such as voicing and fundamental frequency, are used in many speech processing applications and include, for example, speech coding, enhancement, noise estimation, automatic speech recognition in noisy conditions and tonal language speech recognition (Kaewtip et al, 2013;Kawahara et al, 2001;Lei et al, 2006;Ma et al, 2007;McAulay and Champion, 1990;Morales-Cordovilla et al, 2011a,b). Similarly, spectral envelope and formant features are used in a range of applications such as speech coding, synthesis, recognition and voice conversion (Hermansky, 1990;Kawahara et al, 2001Kawahara et al, , 2009Koriyama et al, 2014).…”

Section: Introductionmentioning

confidence: 99%

Estimating acoustic speech features in low signal-to-noise ratios using a statistical framework

Harding

Milner

2017

Computer Speech & Language

View full text Add to dashboard Cite

show abstract

Exploiting correlogram structure for robust speech recognition with multiple speech sources

Abstract: International audienc

Cited by 41 publications

References 26 publications

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

Combining Speech Fragment Decoding and Adaptive Noise Floor Modeling

Estimating acoustic speech features in low signal-to-noise ratios using a statistical framework

Contact Info

Product

Resources

About