2017 IEEE International Conference on Computer Vision Workshops (ICCVW) 2017
DOI: 10.1109/iccvw.2017.61
|View full text |Cite
|
Sign up to set email alerts
|

Improved Speech Reconstruction from Silent Video

Abstract: Speechreading is the task of inferring phonetic information from visually observed articulatory facial movements, and is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible and natural-sounding acoustic speech signal from silent video frames of a speaking person. We train our model on speakers from the GRID and TCD-TIMIT datasets, and evaluate the quality and intelligibility of reconstructe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
81
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
2
2

Relationship

1
8

Authors

Journals

citations
Cited by 81 publications
(81 citation statements)
references
References 43 publications
(69 reference statements)
0
81
0
Order By: Relevance
“…In [25] a deep neural network is developed to generate speech from silent video frames of a speaking person. This model is used in [26] for speech enhancement, where the predicted spectrogram serves as a mask to filter the noisy speech.…”
Section: Arxiv:180404121v2 [Cscv] 19 Jun 2018mentioning
confidence: 99%
“…In [25] a deep neural network is developed to generate speech from silent video frames of a speaking person. This model is used in [26] for speech enhancement, where the predicted spectrogram serves as a mask to filter the noisy speech.…”
Section: Arxiv:180404121v2 [Cscv] 19 Jun 2018mentioning
confidence: 99%
“…Several approaches exist for generation of intelligible speech from silent video frames of a person speaking [5,6,7]. In this work we rely on vid2speech [6], briefly described in Sec. 2.1.…”
Section: Visually-derived Speech Generationmentioning
confidence: 99%
“…We continue with the isolation of the speech of a single visible speaker from background sounds. This work builds upon recent advances in machine speechreading, generating speech from visible motion of the face and mouth [5,6,7].…”
Section: Introductionmentioning
confidence: 99%
“…The LSPs are converted into waveforms but since excitation is not predicted the resulting speech sounds unnatural. This method is extended in [9] by adding optical flow information as input to the network and by adding a postprocessing step, where generated sound features are replaced by their closest match from the training set. A similar method that uses multi-view visual feeds has been proposed in [10].…”
Section: Introductionmentioning
confidence: 99%