Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1065
|View full text |Cite
|
Sign up to set email alerts
|

FaceFilter: Audio-Visual Speech Separation Using Still Images

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
40
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 51 publications
(45 citation statements)
references
References 30 publications
0
40
0
1
Order By: Relevance
“…Nevertheless, the datasets collected in controlled environments are a good choice for training a prototype designed for a specific purpose or for studying a particular problem. Examples of databases useful in this sense are: TCD-TIMIT [89] and OuluVS2 [16], to study the influence of several angles of view; MODALITY [46] and OuluVS2 [16], to determine the effect of different video frame rates; Lombard GRID [13], to understand the impact of the Lombard effect, also from several angles of view; RAVDESS [161], to perform a study of emotions in the context of SE and SS; KinectDigits [224] and MODALITY [46], to determine [43] 34 (18 males) Command sentences 720×576 50 kHz Controlled environment [3], [5], [6], [17], [65], [66], [76], [77] (1,000 of 3 seconds 25 FPS Frontal face [108], [136], [154], [164], [165], [263] per speaker) [176], [183], [195], [203], [239] OuluVS [286] 20 ( Videos in the wild from [7], [42], [122], [128], [153], [ the importance that supplementary information from the depth modality might have; ASPIRE [77], to evaluate the systems in real noisy environments.…”
Section: Audio-visual Corporamentioning
confidence: 99%
See 2 more Smart Citations
“…Nevertheless, the datasets collected in controlled environments are a good choice for training a prototype designed for a specific purpose or for studying a particular problem. Examples of databases useful in this sense are: TCD-TIMIT [89] and OuluVS2 [16], to study the influence of several angles of view; MODALITY [46] and OuluVS2 [16], to determine the effect of different video frame rates; Lombard GRID [13], to understand the impact of the Lombard effect, also from several angles of view; RAVDESS [161], to perform a study of emotions in the context of SE and SS; KinectDigits [224] and MODALITY [46], to determine [43] 34 (18 males) Command sentences 720×576 50 kHz Controlled environment [3], [5], [6], [17], [65], [66], [76], [77] (1,000 of 3 seconds 25 FPS Frontal face [108], [136], [154], [164], [165], [263] per speaker) [176], [183], [195], [203], [239] OuluVS [286] 20 ( Videos in the wild from [7], [42], [122], [128], [153], [ the importance that supplementary information from the depth modality might have; ASPIRE [77], to evaluate the systems in real noisy environments.…”
Section: Audio-visual Corporamentioning
confidence: 99%
“…Estimators of speech quality SNR -It does not provide a proper [12], [65], [66], [109] based on energy ratios (Signal-to-Noise Ratio) estimation of speech distortion SSNR / SSNRI -Assessment of short-time [100], [108], [239] (Segmental SNR) behaviour (SSNR Improvement) SDI [31] 2006 It provides a rough distortion [99], [100] measure SDR [252] 2006 Specifically designed for blind audio [7], [10], [17], [42], [55], [65], [85] source separation [107]- [109], [136], [153], [154], [169] [164], [165], [183], [192], [195], [203] [208], [220]-[222] SIR [252] 2006 Specifically designed for blind audio [7], [65], [107], [136], [164], [165] source separation [195] SAR [252] 2006 Specifically designed for blind audio [65], [107], [136], [164], [165], [195] source separation SI-SDR [150]…”
Section: Ip Transmissionmentioning
confidence: 99%
See 1 more Smart Citation
“…AVS [25] V (face) +A 5.9 AV-BLSTM [11,25] V (face) 3.25 FaceFilter [26] I (face) 2.5 AV-U-Net [27] V (face) 7.6 AV-LSTM [12] V MuSE is different from the 'Looking to listen at the cocktail party' [11]. As MuSE uses speech-lip synchronization information instead of speech-face synchronization cue, MuSE is expected to generalize well for new speakers.…”
Section: Modelmentioning
confidence: 99%
“…Several prior works for speaker extraction have studied various cues about the target speaker, such as voiceprint [11,20,21], lip movement [12,22], facial appearance [23], and spatial information [13].…”
Section: Relation To Prior Workmentioning
confidence: 99%