FaceFilter: Audio-Visual Speech Separation Using Still Images

Chung, Soo-Whan; Choe, Soyeon; Chung, Joon Son; Kang, Hong-Goo

doi:10.21437/interspeech.2020-1065

Cited by 51 publications

(45 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Nevertheless, the datasets collected in controlled environments are a good choice for training a prototype designed for a specific purpose or for studying a particular problem. Examples of databases useful in this sense are: TCD-TIMIT [89] and OuluVS2 [16], to study the influence of several angles of view; MODALITY [46] and OuluVS2 [16], to determine the effect of different video frame rates; Lombard GRID [13], to understand the impact of the Lombard effect, also from several angles of view; RAVDESS [161], to perform a study of emotions in the context of SE and SS; KinectDigits [224] and MODALITY [46], to determine [43] 34 (18 males) Command sentences 720×576 50 kHz Controlled environment [3], [5], [6], [17], [65], [66], [76], [77] (1,000 of 3 seconds 25 FPS Frontal face [108], [136], [154], [164], [165], [263] per speaker) [176], [183], [195], [203], [239] OuluVS [286] 20 ( Videos in the wild from [7], [42], [122], [128], [153], [ the importance that supplementary information from the depth modality might have; ASPIRE [77], to evaluate the systems in real noisy environments.…”

Section: Audio-visual Corporamentioning

confidence: 99%

“…Estimators of speech quality SNR -It does not provide a proper [12], [65], [66], [109] based on energy ratios (Signal-to-Noise Ratio) estimation of speech distortion SSNR / SSNRI -Assessment of short-time [100], [108], [239] (Segmental SNR) behaviour (SSNR Improvement) SDI [31] 2006 It provides a rough distortion [99], [100] measure SDR [252] 2006 Specifically designed for blind audio [7], [10], [17], [42], [55], [65], [85] source separation [107]- [109], [136], [153], [154], [169] [164], [165], [183], [192], [195], [203] [208], [220]-[222] SIR [252] 2006 Specifically designed for blind audio [7], [65], [107], [136], [164], [165] source separation [195] SAR [252] 2006 Specifically designed for blind audio [65], [107], [136], [164], [165], [195] source separation SI-SDR [150]…”

Section: Ip Transmissionmentioning

confidence: 99%

“…[65] AAM of mouth region [136] 2D-DCT of mouth region [3]-[6] Optical flow [17], [65], [154], [164], [165] Landmark-based features [100], [154], [183], [203] Multisensory features [195] Face recognition embedding [55], [109], [169], [192], [239] VSR embedding [7], [10], [107]- [109], [153], [222], [273] Facial appearance embedding [42], [208] Compressed mouth frames [37] Speaker direction [85], [244], [279] Acoustic Features…”

Section: Audio-visual Speech Enhancement and Separation Systemsmentioning

confidence: 99%

See 2 more Smart Citations

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Michelsanti

Tan

Zhang

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

171

View full text Add to dashboard Cite

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. More recently, visual information from the target speakers, such as lip movements and facial expressions, has been introduced to speech enhancement and speech separation systems, because the visual aspect of speech is essentially unaffected by the acoustic environment. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving state-of-the-art performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: visual features; acoustic features; deep learning methods; fusion techniques; training targets and objective functions. We also survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance. In addition, we review deeplearning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation.

show abstract

Section: Audio-visual Corporamentioning

confidence: 99%

Section: Ip Transmissionmentioning

confidence: 99%

Section: Audio-visual Speech Enhancement and Separation Systemsmentioning

confidence: 99%

See 1 more Smart Citation

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Michelsanti

Tan

Zhang

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

171

View full text Add to dashboard Cite

show abstract

“…AVS [25] V (face) +A 5.9 AV-BLSTM [11,25] V (face) 3.25 FaceFilter [26] I (face) 2.5 AV-U-Net [27] V (face) 7.6 AV-LSTM [12] V MuSE is different from the 'Looking to listen at the cocktail party' [11]. As MuSE uses speech-lip synchronization information instead of speech-face synchronization cue, MuSE is expected to generalize well for new speakers.…”

Section: Modelmentioning

confidence: 99%

Muse: Multi-Modal Target Speaker Extraction with Visual Cues

Pan

Tao²,

Xu³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention. Such a reference speech is typically pre-recorded. On the other hand, the temporal synchronization between speech and lip movement also serves as an informative cue. Motivated by this idea, we study a novel technique to use speech-lip visual cues to extract reference target speech directly from mixture speech during inference time, without the need of pre-recorded reference speech. We propose a multi-modal speaker extraction network, named MuSE, that is conditioned only on a lip image sequence. MuSE not only outperforms other competitive baselines in terms of SI-SDR and PESQ, but also shows consistent improvement in cross-dataset evaluations.

show abstract

“…Several prior works for speaker extraction have studied various cues about the target speaker, such as voiceprint [11,20,21], lip movement [12,22], facial appearance [23], and spatial information [13].…”

Section: Relation To Prior Workmentioning

confidence: 99%

Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

Hao

Zhang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In the speaker extraction problem, it is found that additional information from the target speaker contributes to the tracking and extraction of the target speaker, which includes voiceprint, lip movement, facial expression, and spatial information. However, no one cares for the cue of sound onset, which has been emphasized in the auditory scene analysis and psychology. Inspired by it, we explicitly modeled the onset cue and verified the effectiveness in the speaker extraction task. We further extended to the onset/offset cues and got performance improvement. From the perspective of tasks, our onset/offset-based model completes the composite task, a complementary combination of speaker extraction and speaker-dependent voice activity detection. We also combined voiceprint with onset/offset cues. Voiceprint models voice characteristics of the target while onset/offset models the start/end information of the speech. From the perspective of auditory scene analysis, the combination of two perception cues can promote the integrity of the auditory object. The experiment results are also close to state-of-the-art performance, using nearly half of the parameters. We hope that this work will inspire communities of speech processing and psychology, and contribute to communication between them. Our code will be available in https: //github.com/aispeech-lab/wase/.

show abstract

FaceFilter: Audio-Visual Speech Separation Using Still Images

Cited by 51 publications

References 30 publications

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Muse: Multi-Modal Target Speaker Extraction with Visual Cues

Wase: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

Contact Info

Product

Resources

About