Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2671
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating Audiovisual Source Separation in the Context of Video Conferencing

Abstract: Source separation involving mono-channel audio is a challenging problem, in particular for speech separation where source contributions overlap both in time and frequency. This task is of high interest for applications such as video conferencing. Recent progress in machine learning has shown that the combination of visual cues, coming from the video, can increase the source separation performance. Starting from a recently designed deep neural network, we assess its ability and robustness to separate the visibl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 7 publications
(16 citation statements)
references
References 9 publications
0
14
0
1
Order By: Relevance
“…Estimators of speech SII [110] 1997 Used for additive stationary noise or [108] intelligibility bandwidth reduction CSII [130] 2004 Extension of SII for broadband peak- [108] clipping and center-clipping distortion ESII [210] 2005 Extension of SII for fluctuating noise [108] STOI [241] 2011 Able to predict quite accurately speech [7], [37], [55], [77], [85], [108], [109] intelligibility in several situations [99], [122], [128], [136] HASPI [132] 2014 Specifically designed for hearing- [99], [100] impaired listeners ESTOI [124] 2016 Extension of STOI for highly [107], [108], [176], [178], [179], [244] optimally performed, because floor or ceiling effects might occur if the listeners' task is too hard or too easy. This issue can be mitigated by testing the system at several SNR within a pre-determined range, at the expense of the time needed to conduct the listening experiments.…”
Section: Invariantmentioning
confidence: 99%
See 4 more Smart Citations
“…Estimators of speech SII [110] 1997 Used for additive stationary noise or [108] intelligibility bandwidth reduction CSII [130] 2004 Extension of SII for broadband peak- [108] clipping and center-clipping distortion ESII [210] 2005 Extension of SII for fluctuating noise [108] STOI [241] 2011 Able to predict quite accurately speech [7], [37], [55], [77], [85], [108], [109] intelligibility in several situations [99], [122], [128], [136] HASPI [132] 2014 Specifically designed for hearing- [99], [100] impaired listeners ESTOI [124] 2016 Extension of STOI for highly [107], [108], [176], [178], [179], [244] optimally performed, because floor or ceiling effects might occur if the listeners' task is too hard or too easy. This issue can be mitigated by testing the system at several SNR within a pre-determined range, at the expense of the time needed to conduct the listening experiments.…”
Section: Invariantmentioning
confidence: 99%
“…This means that in order to have good performance in a wide variety of settings, very large AV datasets for training and testing need to be collected. In practice, the systems are trained using a large number of complex acoustic [66], [76], [77], [85], [99], [122], [128], [164], [165], [176], [178], [179], [220]- [222], [244], [263], [274], [ [17], [65], [154], [164], [165] Landmark-based features [100], [154], [183], [203] Multisensory features [195] Face recognition embedding [55], [109], [169], [192], [239] VSR embedding [7], [10], [107]- [109], [153], [222], [273] Facial appearance embedding [42], [208] Compressed mouth frames [37] Speaker direction [85], [244], [279] Acoustic Features…”
Section: Audio-visual Speech Enhancement and Separation Systemsmentioning
confidence: 99%
See 3 more Smart Citations