2021 IEEE Winter Conference on Applications of Computer Vision (WACV) 2021
DOI: 10.1109/wacv48630.2021.00197
|View full text |Cite
|
Sign up to set email alerts
|

Visual Speech Enhancement Without A Real Visual Stream

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(3 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…Such information would be learned by the GAN during training and can be beneficial in speech-in-noise tasks. The latter conclusion is supported by the results presented by Hegde et al (2021) , who recently showed that hallucinating a visual stream by generating it from the audio input can aid to reduce background noise and increase speech intelligibility. Importantly, they also showed that humans scores on subjective scales such as quality and intelligibility were higher for speech denoised in such a way.…”
Section: Discussionmentioning
confidence: 59%
“…Such information would be learned by the GAN during training and can be beneficial in speech-in-noise tasks. The latter conclusion is supported by the results presented by Hegde et al (2021) , who recently showed that hallucinating a visual stream by generating it from the audio input can aid to reduce background noise and increase speech intelligibility. Importantly, they also showed that humans scores on subjective scales such as quality and intelligibility were higher for speech denoised in such a way.…”
Section: Discussionmentioning
confidence: 59%
“…short in providing complementary information to enhance AO models as our method. For a previous work (Hegde et al 2021) similar to ours, where pseudo videos are generated from noisy audio by an enhanced speech2lip model to facilitate speech enhancement, our approach achieves 63% (7.1% → 2.6%) and 66.7% (9.3% → 3.1%) relative WER reduction. We attribute this notable performance enhancement to the following two factors: 1) Compared to generating accurate pseudo videos with hundreds of frames in sync with audio, sequence-to-sequence generation based on discrete space is easier to achieve.…”
Section: Evaluation and Analysismentioning
confidence: 74%
“…However, a major limitation of this method is its disregard of the high-level semantic relationships between the audio and visual modalities, resulting in the generation of pseudo videos with low information density. Therefore, an additional visual encoder is required in (Hegde et al 2021) to extract the semantic features with a higher correlation to the speech content.…”
Section: Introductionmentioning
confidence: 99%