Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1400
|View full text |Cite
|
Sign up to set email alerts
|

The Conversation: Deep Audio-Visual Speech Enhancement

Abstract: Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. The method is applicable to speakers unheard and unseen during traini… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
328
0
1

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 283 publications
(330 citation statements)
references
References 40 publications
1
328
0
1
Order By: Relevance
“…Recent research on leveraging visual modality has led to impressive results in speech separation. In these studies, various representations of the visual information such as lip appearance [17,16] and optical flow [18,19] are used to estimate the time-frequency (TF) mask. In this paper, the audio-visual speech separation 1 component used in the pipelined system is based on our previous work in [21].…”
Section: Audio-visual Speech Separationmentioning
confidence: 99%
See 2 more Smart Citations
“…Recent research on leveraging visual modality has led to impressive results in speech separation. In these studies, various representations of the visual information such as lip appearance [17,16] and optical flow [18,19] are used to estimate the time-frequency (TF) mask. In this paper, the audio-visual speech separation 1 component used in the pipelined system is based on our previous work in [21].…”
Section: Audio-visual Speech Separationmentioning
confidence: 99%
“…Motivated by the bimodal nature of human speech perception [2,10], and the invariance of visual information to acoustic signal corruption, audio-visual speech recognition (AVSR) technologies [11,12,13,14] can also be used for overlapped speech separation [15,16,17,18,19,20,21,22] and the back-end recognition component. However, the use of visual modality in the recognition stage of system development for overlapped speech remains limited to date.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The bottleneck layer activations were used as visual deep features. We also tried using the visual features in [15,16], however we found that these were less effective due to the unavailability of the training data used in [15,16]. The network architectures were the same as the one in the proposed model with E2EASR features.…”
Section: Baselinesmentioning
confidence: 99%
“…Audio-visual speech enhancement methods incorporate also the visual information (video frames) associated with the noisy speech, aiming to improve the quality of the enhanced speech signal [11][12][13]. Using the video modality is Xavier Alameda-Pineda acknowledges ANR and the IDEX for funding the ML3RI project.…”
Section: Introductionmentioning
confidence: 99%