Analysis of the visual Lombard effect and automatic recognition experiments

Heracleous, Panikos; Ishi, Carlos Toshinori; Sato, Miki; Ishiguro, Hiroshi; Hagita, Norihiro

doi:10.1016/j.csl.2012.06.003

Cited by 14 publications

(9 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The training and the evaluation of the systems are usually performed with speech recorded in quiet and afterwards degraded with additive noise. Previous work shows that speaker (Hansen and Varadarajan, 2009) and speech recognition (Junqua, 1993) systems that ignore Lombard effect achieve sub-optimal performance, also in visual (Heracleous et al, 2013;Marxer et al, 2018) and audiovisual settings (Heracleous et al, 2013). It is therefore of interest to conduct a similar study also in a SE context.…”

Section: Introductionmentioning

confidence: 98%

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Michelsanti

Sigurðsson

Jensen

2019

Speech Communication

View full text Add to dashboard Cite

Keywords:Lombard effect audio-visual speech enhancement deep learning speech quality speech intelligibility A B S T R A C T When speaking in presence of background noise, humans reflexively change their way of speaking in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Collecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially added. Since these systems are often used in situations where Lombard speech occurs, in this work we perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field.We conduct several experiments using an audio-visual Lombard speech corpus consisting of utterances spoken by 54 different talkers. The results show that training deep-learning-based models with Lombard speech is beneficial in terms of both estimated speech quality and estimated speech intelligibility at low signal to noise ratios, where the visual modality can play an important role in acoustically challenging situations. We also find that a performance difference between genders exists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relation with acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimuli show that the speech quality of the signals processed with systems trained using Lombard speech is statistically significantly better than the one obtained using systems trained with non-Lombard speech at a signal to noise ratio of −5 dB. Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech.

show abstract

Section: Introductionmentioning

confidence: 98%

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Michelsanti

Sigurðsson

Jensen

2019

Speech Communication

View full text Add to dashboard Cite

show abstract

“…The mismatch between the neutral and the Lombard speaking styles can lead to sub-optimal performance of audio-only-based speaker [15] and speech recognition [2] systems. Only a few works investigate the impact of the Lombard effect on visual [16,17] and audio-visual [16] automatic speech recognition, but, to the best knowledge of the authors, no studies have been conducted for AV-SE systems.…”

Section: Introductionmentioning

confidence: 99%

Effects of Lombard Reflex on the Performance of Deep-learning-based Audio-visual Speech Enhancement Systems

Michelsanti

Sigurðsson

Jensen

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Humans tend to change their way of speaking when they are immersed in a noisy environment, a reflex known as Lombard effect. Current speech enhancement systems based on deep learning do not usually take into account this change in the speaking style, because they are trained with neutral (non-Lombard) speech utterances recorded under quiet conditions to which noise is artificially added. In this paper, we investigate the effects that the Lombard reflex has on the performance of audio-visual speech enhancement systems based on deep learning. The results show that a gap in the performance of as much as approximately 5 dB between the systems trained on neutral speech and the ones trained on Lombard speech exists. This indicates the benefit of taking into account the mismatch between neutral and Lombard speech in the design of audio-visual speech enhancement systems.Index Terms-Audio-visual speech enhancement, deep learning, Lombard effect

show abstract

“…Finally, we report results on sentence-level speech recognition. This is in contrast to previous works which mainly focus either on isolated words [16] or on specific words within a sentence [13]. We believe that the conclusions reached by this approach can be more useful for a practical speech recognition system where the goal will most likely be to recognise all words in a sentence rather than recognise just isolated words.…”

Section: Introductionmentioning

confidence: 81%

“…As expected the improvement is higher when visual Lombard speech is used for training. On the other hand, Heracleous et al [16] reported a performance drop when there is a mismatch between training and testing conditions. The same conclusion was also reached when an audio-visual speech recognition system was used.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition

Petridis

Pantić

2019

Interspeech 2019

View full text Add to dashboard Cite

Several audio-visual speech recognition models have been recently proposed which aim to improve the robustness over audio-only models in the presence of noise. However, almost all of them ignore the impact of the Lombard effect, i.e., the change in speaking style in noisy environments which aims to make speech more intelligible and affects both the acoustic characteristics of speech and the lip movements. In this paper, we investigate the impact of the Lombard effect in audio-visual speech recognition. To the best of our knowledge, this is the first work which does so using end-to-end deep architectures and presents results on unseen speakers. Our results show that properly modelling Lombard speech is always beneficial. Even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improved. We also show that the standard approach followed in the literature, where a model is trained and tested on noisy plain speech, provides a correct estimate of the video-only performance and slightly underestimates the audio-visual performance. In case of audioonly approaches, performance is overestimated for SNRs higher than -3dB and underestimated for lower SNRs.

show abstract

Analysis of the visual Lombard effect and automatic recognition experiments

Cited by 14 publications

References 10 publications

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Effects of Lombard Reflex on the Performance of Deep-learning-based Audio-visual Speech Enhancement Systems

Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition

Contact Info

Product

Resources

About