Unsupervised Equalization of Lombard Effect for Speech Recognition in Noisy Adverse Environments

Bořil, Hynek; Hansen, John H. L.

doi:10.1109/tasl.2009.2034770

Cited by 70 publications

(44 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Relevant examples include level and/or frequency equalization techniques which attempt to transform the speech-plus-noise signal so that its features mimic a set of reference features calculated in the temporal, spectral, or cepstral domains. One line of work equalizes the noisy input signal to reflect the characteristics of the clean speech used to train the ASR system (Hilger and Ney, 2006;Joshi et al, 2011), while other work has focused on undoing the characteristics of Lombard speech (Boril and Hansen, 2010). Other techniques involve more complex models of intelligibility with the explicit goal of enhancing some intelligibility metric and may operate on clean speech prior to the addition of noise (Chanda and Park, 2007).…”

Section: B Comparison With Other Methodsmentioning

confidence: 99%

Masking release for hearing-impaired listeners: The effect of increased audibility through reduction of amplitude variability

Desloge

Reed

Braida

et al. 2017

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

The masking release (i.e., better speech recognition in fluctuating compared to continuous noise backgrounds) observed for normal-hearing (NH) listeners is generally reduced or absent in hearing-impaired (HI) listeners. One explanation for this lies in the effects of reduced audibility: elevated thresholds may prevent HI listeners from taking advantage of signals available to NH listeners during the dips of temporally fluctuating noise where the interference is relatively weak. This hypothesis was addressed through the development of a signal-processing technique designed to increase the audibility of speech during dips in interrupted noise. This technique acts to (i) compare short-term and long-term estimates of energy, (ii) increase the level of short-term segments whose energy is below the average energy, and (iii) normalize the overall energy of the processed signal to be equivalent to that of the original long-term estimate. Evaluations of this energy-equalizing (EEQ) technique included consonant identification and sentence reception in backgrounds of continuous and regularly interrupted noise. For HI listeners, performance was generally similar for processed and unprocessed signals in continuous noise; however, superior performance for EEQ processing was observed in certain regularly interrupted noise backgrounds.

show abstract

Section: B Comparison With Other Methodsmentioning

confidence: 99%

Masking release for hearing-impaired listeners: The effect of increased audibility through reduction of amplitude variability

Desloge

Reed

Braida

et al. 2017

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

show abstract

“…Previous studies on robust ASR for stressed speech and Lombard effect speech have reported performance gains when altering configurations of the front-end feature extraction filter banks [26][27][28]. Inspired by [28], our first step is to replace the Mel and Bark filter banks (FB) in MFCC and PLP by a bank of triangular and rectangular filters uniformly distributed over a linear frequency axis.…”

Section: Modified Front-endsmentioning

confidence: 99%

“…Inspired by [28], our first step is to replace the Mel and Bark filter banks (FB) in MFCC and PLP by a bank of triangular and rectangular filters uniformly distributed over a linear frequency axis. In the case of the triangular bank, the band cutoffs are located at the center frequencies of the adjacent filters while the rectangular filters are stacked next to each other without overlap as in [27].…”

Section: Modified Front-endsmentioning

confidence: 99%

UT-Vocal Effort II: Analysis and constrained-lexicon recognition of whispered speech

Ghaffarzadegan

Bořil

Hansen

2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

This study focuses on acoustic variations in speech introduced by whispering, and proposes several strategies to improve robustness of automatic speech recognition of whispered speech with neutral-trained acoustic models. In the analysis part, differences in neutral and whispered speech captured in the UT-Vocal Effort II corpus are studied in terms of energy, spectral slope, and formant center frequency and bandwidth distributions in silence, voiced, and unvoiced speech signal segments. In the part dedicated to speech recognition, several strategies involving front-end filter bank redistribution, cepstral dimensionality reduction, and lexicon expansion for alternative pronunciations are proposed. The proposed neutral-trained system employing redistributed filter bank and reduced features provides a 7.7 % absolute WER reduction over the baseline system trained on neutral speech, and a 1.3 % reduction over a baseline system with whisper-adapted acoustic models.

show abstract

“…The fundamental frequency of speech (F 0 ) is known to be affected by stress [19,20], emotions [19,21], and talking styles [22]. Different languages may exhibit unique F 0 characteristics [23] and the same may be observed also for individual dialects of a language [24].…”

Section: Fundamental Frequency Analysismentioning

confidence: 99%

Engineering Analysis and Recognition of Nigerian English: An Insight into Low Resource Languages

Amuda

Bořil²,

Sangwan³

et al. 2014

TMLAI

Self Cite

View full text Add to dashboard Cite

A comparative analysis between Nigerian English (NE) and American English (AE) is presented in this article. The study is aimed at highlighting differences in the speech parameters, and how they influence speech processing and automatic speech recognition (ASR). The UILSpeech corpus of Nigerian-Accented English isolated word recordings, read speech utterances, and video recordings are used as a reference for Nigerian English. The corpus captures the linguistic diversity of Nigeria with data collected from native speakers of Hausa, Igbo, and Yoruba languages. The UILSpeech corpus is intended to provide a unique opportunity for application and expansion of speech processing techniques to a limited resource language dialect. The acoustic-phonetic differences between American English (AE) and Nigerian English (NE) are studied in terms of pronunciation variations, vowel locations in the formant space, mean fundamental frequency, and phone model distances in the acoustic space, as well as through visual speech analysis of the speakers' articulators. A strong impact of the AE-NE acoustic mismatch on ASR is observed. A combination of model adaptation and extension of the AE lexicon for newly established NE pronunciation variants is shown to substantially improve performance of the AE-trained ASR system in the new NE task. This study is a part of the pioneering efforts towards incorporating speech technology in Nigerian English and is intended to provide a development basis for other low resource language dialects and languages.

show abstract

Unsupervised Equalization of Lombard Effect for Speech Recognition in Noisy Adverse Environments

Cited by 70 publications

References 49 publications

Masking release for hearing-impaired listeners: The effect of increased audibility through reduction of amplitude variability

Masking release for hearing-impaired listeners: The effect of increased audibility through reduction of amplitude variability

UT-Vocal Effort II: Analysis and constrained-lexicon recognition of whispered speech

Engineering Analysis and Recognition of Nigerian English: An Insight into Low Resource Languages

Contact Info

Product

Resources

About