Evaluating Audiovisual Source Separation in the Context of Video Conferencing

Inan, Berkay; Cerňak, Miloš; Gräbner, Helmut; Tukuljac, Helena Peić; Pena, Rodrigo C. G.; Ricaud, Benjamin

doi:10.21437/interspeech.2019-2671

Cited by 7 publications

(16 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Estimators of speech SII [110] 1997 Used for additive stationary noise or [108] intelligibility bandwidth reduction CSII [130] 2004 Extension of SII for broadband peak- [108] clipping and center-clipping distortion ESII [210] 2005 Extension of SII for fluctuating noise [108] STOI [241] 2011 Able to predict quite accurately speech [7], [37], [55], [77], [85], [108], [109] intelligibility in several situations [99], [122], [128], [136] HASPI [132] 2014 Specifically designed for hearing- [99], [100] impaired listeners ESTOI [124] 2016 Extension of STOI for highly [107], [108], [176], [178], [179], [244] optimally performed, because floor or ceiling effects might occur if the listeners' task is too hard or too easy. This issue can be mitigated by testing the system at several SNR within a pre-determined range, at the expense of the time needed to conduct the listening experiments.…”

Section: Invariantmentioning

confidence: 99%

“…This means that in order to have good performance in a wide variety of settings, very large AV datasets for training and testing need to be collected. In practice, the systems are trained using a large number of complex acoustic [66], [76], [77], [85], [99], [122], [128], [164], [165], [176], [178], [179], [220]- [222], [244], [263], [274], [ [17], [65], [154], [164], [165] Landmark-based features [100], [154], [183], [203] Multisensory features [195] Face recognition embedding [55], [109], [169], [192], [239] VSR embedding [7], [10], [107]- [109], [153], [222], [273] Facial appearance embedding [42], [208] Compressed mouth frames [37] Speaker direction [85], [244], [279] Acoustic Features…”

Section: Audio-visual Speech Enhancement and Separation Systemsmentioning

confidence: 99%

“…Magnitude spectrogram [3]- [7], [10], [12], [17], [37], [42], [65], [66], [76], [77], [85], [99], [100], [107], [122], [128], [136], [153], [154], [164] [165], [176], [178], [179], [183], [192], [195], [203], [208], [220]- [222], [244], [263], [274], [279] Phase a [7], [10], [153] Complex spectrogram [55], [107], [109], [169], [239] Raw waveform [108], [273] Speaker embeddings [10], [85], [169], [192], [ [6],…”

Section: Audio-visual Speech Enhancement and Separation Systemsmentioning

confidence: 99%

“…The features extracted from such a network provided an AV representation that allows to achieve superior performance compared to an AO-SE approach. Besides multisensory features, embeddings extracted with face recognition [55] or VSR [7] models have been shown to be effective.İnan et al [109] performed a study to evaluate the differences between these two kinds of embeddings. Their results showed that VSR embeddings were able to separate voice activity and silence regions better than face recognition embeddings, which could provide a better distinction between speakers instead.…”

Section: A Visual Featuresmentioning

confidence: 99%

“…Recent works have used as acoustic input to the AV system either the magnitude spectrogram and the respective phase [7], [10], [153], the real and the imaginary parts of the complex spectrogram [55], [107], [109], [169], [239], or directly the raw waveform [108], [273]. Although these approaches allow to incorporate and process the full information of an acoustic signal, research in this area is still active and suggests that there is still room for improvement by exploiting the full information of the noisy speech signal [168], [281].…”

Section: B Acoustic Featuresmentioning

confidence: 99%

See 4 more Smart Citations

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Michelsanti

Sigurðsson

Jensen

2019

Speech Communication

View full text Add to dashboard Cite

Keywords:Lombard effect audio-visual speech enhancement deep learning speech quality speech intelligibility A B S T R A C T When speaking in presence of background noise, humans reflexively change their way of speaking in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Collecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially added. Since these systems are often used in situations where Lombard speech occurs, in this work we perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field.We conduct several experiments using an audio-visual Lombard speech corpus consisting of utterances spoken by 54 different talkers. The results show that training deep-learning-based models with Lombard speech is beneficial in terms of both estimated speech quality and estimated speech intelligibility at low signal to noise ratios, where the visual modality can play an important role in acoustically challenging situations. We also find that a performance difference between genders exists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relation with acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimuli show that the speech quality of the signals processed with systems trained using Lombard speech is statistically significantly better than the one obtained using systems trained with non-Lombard speech at a signal to noise ratio of −5 dB. Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech.

show abstract