Effects of Lombard Reflex on the Performance of Deep-learning-based Audio-visual Speech Enhancement Systems

Michelsanti, Daniel; Sigurðsson, Sigurður; Jensen, Jesper

doi:10.1109/icassp.2019.8682713

Cited by 6 publications

(6 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the objective of providing a more extensive analysis of the impact of Lombard effect on deep-learning-based SE systems, the present work extends a preliminary study (Michelsanti et al, 2019a), providing the following novel contributions. First, new experiments are conducted, where deep-learning-based SE systems trained with Lombard or non-Lombard speech are evaluated on Lombard speech using a cross-validation setting to avoid that a potential intraspeaker variability of the adopted dataset leads to biased conclusions.…”

Section: Introductionmentioning

confidence: 86%

“…In this study, we train and evaluate systems that perform spectral SE using deep learning, as illustrated in Figure 1. The processing pipeline is inspired by Gabbay et al (2018) and the same as the one used in (Michelsanti et al, 2019a).…”

Section: Methodsmentioning

confidence: 99%

“…In order to induce the Lombard effect, speech shaped noise (SSN) at 80 dB sound pressure level (SPL) was presented to the speakers, while they were reading the sentences to a listener. The presence of a listener, who assured a natural communication environment by asking the participants to repeat the utterances from time to time, was needed, because talkers usually adjust their speech to communicate better with the people they are talking to (Lane and Tranel, 1971;Lu and Cooke, 2008), a process known as external or public loop (Lane and Tranel, 1971) Figure 1: Pipeline of the audio-visual speech enhancement framework used in this study, adapted from (Gabbay et al, 2018), and identical to (Michelsanti et al, 2019a). The deep-learning-based system estimates an ideal amplitude mask from the video of the speaker's mouth and the magnitude spectrogram of the noisy speech.…”

Section: Materials: Audio-visual Speech Corpus and Noise Datamentioning

confidence: 99%

“…Since we would like to assess the performance of SE systems when Lombard speech occurs, SSN is added to the speech signals from the Lombard GRID corpus at 6 different SNRs, in uniform steps between −20 dB and 5 dB. This choice was driven by the following considerations (Michelsanti et al, 2019a). Since Lombard and non-Lombard utterances from the Lombard GRID corpus have an energy difference between 3 and 13 dB (Marxer et al, 2018), the actual SNR can be computed assuming that the conversational speech level is between 60 and 70 dB sound pressure level (SPL) (Raphael et al, 2007;Moore et al, 2012) and the noise level at 80 dB SPL, like in the recording conditions of the database.…”

Section: Systems Trained On a Narrow Snr Rangementioning

confidence: 99%

See 3 more Smart Citations

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Michelsanti

Sigurðsson

Jensen

2019

Speech Communication

Self Cite

View full text Add to dashboard Cite

Keywords:Lombard effect audio-visual speech enhancement deep learning speech quality speech intelligibility A B S T R A C T When speaking in presence of background noise, humans reflexively change their way of speaking in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Collecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially added. Since these systems are often used in situations where Lombard speech occurs, in this work we perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field.We conduct several experiments using an audio-visual Lombard speech corpus consisting of utterances spoken by 54 different talkers. The results show that training deep-learning-based models with Lombard speech is beneficial in terms of both estimated speech quality and estimated speech intelligibility at low signal to noise ratios, where the visual modality can play an important role in acoustically challenging situations. We also find that a performance difference between genders exists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relation with acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimuli show that the speech quality of the signals processed with systems trained using Lombard speech is statistically significantly better than the one obtained using systems trained with non-Lombard speech at a signal to noise ratio of −5 dB. Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech.

show abstract

Section: Introductionmentioning

confidence: 86%

Section: Methodsmentioning

confidence: 99%

Section: Materials: Audio-visual Speech Corpus and Noise Datamentioning

confidence: 99%

Section: Systems Trained On a Narrow Snr Rangementioning

confidence: 99%

See 2 more Smart Citations

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Michelsanti

Sigurðsson

Jensen

2019

Speech Communication

Self Cite

View full text Add to dashboard Cite

show abstract

“…The same conclusion was also reached when an audio-visual speech recognition system was used. Finally, it has recently been shown that the mismatch between plain and Lombard speech can also affect the performance of audio-visual speech enhancement models [17].…”

Section: Introductionmentioning

confidence: 99%

Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition

Petridis

Pantić

2019

Interspeech 2019

View full text Add to dashboard Cite

Several audio-visual speech recognition models have been recently proposed which aim to improve the robustness over audio-only models in the presence of noise. However, almost all of them ignore the impact of the Lombard effect, i.e., the change in speaking style in noisy environments which aims to make speech more intelligible and affects both the acoustic characteristics of speech and the lip movements. In this paper, we investigate the impact of the Lombard effect in audio-visual speech recognition. To the best of our knowledge, this is the first work which does so using end-to-end deep architectures and presents results on unseen speakers. Our results show that properly modelling Lombard speech is always beneficial. Even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improved. We also show that the standard approach followed in the literature, where a model is trained and tested on noisy plain speech, provides a correct estimate of the video-only performance and slightly underestimates the audio-visual performance. In case of audioonly approaches, performance is overestimated for SNRs higher than -3dB and underestimated for lower SNRs.

show abstract

Redundant Convolutional Network With Attention Mechanism For Monaural Speech Enhancement

Lan

Lyu

Hui

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Effects of Lombard Reflex on the Performance of Deep-learning-based Audio-visual Speech Enhancement Systems

Cited by 6 publications

References 31 publications

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition

Redundant Convolutional Network With Attention Mechanism For Monaural Speech Enhancement

Contact Info

Product

Resources

About