Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Michelsanti, Daniel; Sigurðsson, Sigurður; Jensen, Jesper

doi:10.1016/j.specom.2019.10.006

Cited by 32 publications

(55 citation statements)

References 250 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…As seen from this discussion, not only the experimental setup was different compared to our approach, but the analysis also differs from that performed by us; thus a direct comparison is not possible. Even the observation of what SNR value the model does not work at seems to be uncommon; in the case of our research, the models stop working at a threshold of -5 dB, in the work of Michelsanti et al [57], it refers to 5 dB.…”

Section: B Results Analysismentioning

confidence: 51%

“…The ESTOI values changed dramatically from 0.442 for -20 dB to -5 dB SNR, up to 0.927 for a SNR range between 10 and 30 dB. So, the relative performance of the systems at SNR ≤ 5 dB is similar to that observed for the systems trained on a narrow SNR range [57].…”

Section: B Results Analysismentioning

confidence: 51%

“…When comparing the obtained results to the state-of-theart, one can see that such a comparison in practice is not straightforward. For example, Michelsanti et al [57] reported averaged scores of PESQ and ESTOI (Extended Short-Time Objective Intelligibility) [58] measures for a deep-learningbased system of audio-visual speech enhancement with the Lombard effect applied. To elicit the Lombard effect, Speech Shaped Noise (SSN) at 80 dB Sound Pressure Level (SPL) was presented to the speakers while they were reading the sentences [57].…”

Section: B Results Analysismentioning

confidence: 99%

“…For example, Michelsanti et al [57] reported averaged scores of PESQ and ESTOI (Extended Short-Time Objective Intelligibility) [58] measures for a deep-learningbased system of audio-visual speech enhancement with the Lombard effect applied. To elicit the Lombard effect, Speech Shaped Noise (SSN) at 80 dB Sound Pressure Level (SPL) was presented to the speakers while they were reading the sentences [57]. It is worth noting that ESTOI scores, which estimate speech intelligibility, range from 0 to 1, where high values correspond to high speech intelligibility.…”

Section: B Results Analysismentioning

confidence: 99%

“…Results obtained by Michelsanti et al [57] refer to two types of subjective tests, namely the MUSHRA and speech intelligibility tests. For an AO-L at -5 dB SNR, the result was approx.…”

Section: Subjective Test Resultsmentioning

confidence: 99%

See 4 more Smart Citations

Evaluation of Lombard Speech Models in the Context of Speech in Noise Enhancement

et al. 2020

View full text Add to dashboard Cite

The Lombard effect is one of the most well-known effects of noise on speech production. Speech with the Lombard effect is more easily recognizable in noisy environments than normal natural speech. Our previous investigations showed that speech synthesis models might retain Lombard-effect characteristics. In this study, we investigate several speech models, such as harmonic, source-filter, and sinusoidal, applied to Lombard speech in the context of speech enhancement. For this purpose, 100 utterances of natural speech, and 100 with the Lombard effect induced are used. The goal of this study is to check to what extent speech utterances based on these models are recognizable and at what SNR (Signal-to-Noise Ratio) level threshold a particular model stops working. For this purpose, the synthesized models and Lombard speech are mixed with babble speech and street noise recordings with different SNRs. The quality of these models is measured, employing objective indicators as well as subjective tests. Since there is no standardized measure to apply to enhanced speech, an objective measure of assessing the speech quality of a model synthesizing Lombard speech characteristics, based on a feature vector, is proposed. Our approach is then compared with the standardized metric used in telecommunications as well as with subjective test results. The experimental investigations show the superiority of the source-filter models applied to synthesize Lombard speech over other models utilized. Also, the measure proposed correlates more closely with the results of the subjective evaluation than the outcomes from the ITU-T P.563 recommendation. This was checked with a ANOVA statistical analysis.

show abstract

Section: B Results Analysismentioning

confidence: 51%

Section: B Results Analysismentioning

confidence: 51%

Section: B Results Analysismentioning

confidence: 99%

Section: B Results Analysismentioning

confidence: 99%

“…Results obtained by Michelsanti et al [57] refer to two types of subjective tests, namely the MUSHRA and speech intelligibility tests. For an AO-L at -5 dB SNR, the result was approx.…”

Section: Subjective Test Resultsmentioning

confidence: 99%

See 3 more Smart Citations

Evaluation of Lombard Speech Models in the Context of Speech in Noise Enhancement

et al. 2020

View full text Add to dashboard Cite

show abstract

Speech enhancement with noise estimation and filtration using deep learning models

Kantamaneni

Charles

Babu³

2023

Theoretical Computer Science

View full text Add to dashboard Cite

Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

Hao

Zhang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In the speaker extraction problem, it is found that additional information from the target speaker contributes to the tracking and extraction of the target speaker, which includes voiceprint, lip movement, facial expression, and spatial information. However, no one cares for the cue of sound onset, which has been emphasized in the auditory scene analysis and psychology. Inspired by it, we explicitly modeled the onset cue and verified the effectiveness in the speaker extraction task. We further extended to the onset/offset cues and got performance improvement. From the perspective of tasks, our onset/offset-based model completes the composite task, a complementary combination of speaker extraction and speaker-dependent voice activity detection. We also combined voiceprint with onset/offset cues. Voiceprint models voice characteristics of the target while onset/offset models the start/end information of the speech. From the perspective of auditory scene analysis, the combination of two perception cues can promote the integrity of the auditory object. The experiment results are also close to state-of-the-art performance, using nearly half of the parameters. We hope that this work will inspire communities of speech processing and psychology, and contribute to communication between them. Our code will be available in https: //github.com/aispeech-lab/wase/.

show abstract

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Cited by 32 publications

References 250 publications

Evaluation of Lombard Speech Models in the Context of Speech in Noise Enhancement

Evaluation of Lombard Speech Models in the Context of Speech in Noise Enhancement

Speech enhancement with noise estimation and filtration using deep learning models

Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

Contact Info

Product

Resources

About