Visual Speech Enhancement Without A Real Visual Stream

Hegde, Sindhu B; Prajwal, K R; Mukhopadhyay, Rudrabha; Namboodiri, Vinay P.; Jawahar, C. V.

doi:10.1109/wacv48630.2021.00197

Cited by 10 publications

(3 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such information would be learned by the GAN during training and can be beneficial in speech-in-noise tasks. The latter conclusion is supported by the results presented by Hegde et al (2021) , who recently showed that hallucinating a visual stream by generating it from the audio input can aid to reduce background noise and increase speech intelligibility. Importantly, they also showed that humans scores on subjective scales such as quality and intelligibility were higher for speech denoised in such a way.…”

Section: Discussionmentioning

confidence: 59%

Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans

Varano

Vougioukas

et al. 2022

Front. Neurosci.

View full text Add to dashboard Cite

Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speaker’s face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to synthesize photorealistic talking faces from a speech recording and a still image of a person’s face in an end-to-end manner. However, it has remained unknown whether such facial animations improve speech-in-noise comprehension. Here we consider facial animations produced by a recently introduced generative adversarial network (GAN), and show that humans cannot distinguish between the synthesized and the natural videos. Importantly, we then show that the end-to-end synthesized videos significantly aid humans in understanding speech in noise, although the natural facial motions yield a yet higher audiovisual benefit. We further find that an audiovisual speech recognizer (AVSR) benefits from the synthesized facial animations as well. Our results suggest that synthesizing facial motions from speech can be used to aid speech comprehension in difficult listening environments.

show abstract

Section: Discussionmentioning

confidence: 59%

Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans

Varano

Vougioukas

et al. 2022

Front. Neurosci.

View full text Add to dashboard Cite

show abstract

“…short in providing complementary information to enhance AO models as our method. For a previous work (Hegde et al 2021) similar to ours, where pseudo videos are generated from noisy audio by an enhanced speech2lip model to facilitate speech enhancement, our approach achieves 63% (7.1% → 2.6%) and 66.7% (9.3% → 3.1%) relative WER reduction. We attribute this notable performance enhancement to the following two factors: 1) Compared to generating accurate pseudo videos with hundreds of frames in sync with audio, sequence-to-sequence generation based on discrete space is easier to achieve.…”

Section: Evaluation and Analysismentioning

confidence: 74%

“…However, a major limitation of this method is its disregard of the high-level semantic relationships between the audio and visual modalities, resulting in the generation of pseudo videos with low information density. Therefore, an additional visual encoder is required in (Hegde et al 2021) to extract the semantic features with a higher correlation to the speech content.…”

Section: Introductionmentioning

confidence: 99%

Visual Hallucination Elevates Speech Recognition

Zhang,

Zhu,

Wang

et al. 2024

AAAI

View full text Add to dashboard Cite

Due to the detrimental impact of noise on the conventional audio speech recognition (ASR) task, audio-visual speech recognition~(AVSR) has been proposed by incorporating both audio and visual video signals. Although existing methods have demonstrated that the aligned visual input of lip movements can enhance the robustness of AVSR systems against noise, the paired videos are not always available during inference, leading to the problem of the missing visual modality, which restricts their practicality in real-world scenarios. To tackle this problem, we propose a Discrete Feature based Visual Generative Model (DFVGM) which exploits semantic correspondences between the audio and visual modalities during training, generating visual hallucinations in lieu of real videos during inference. To achieve that, the primary challenge is to generate the visual hallucination given the noisy audio while preserving semantic correspondences with the clean speech. To tackle this challenge, we start with training the audio encoder in the Audio-Only (AO) setting, which generates continuous semantic features closely associated with the linguistic information. Simultaneously, the visual encoder is trained in the Visual-Only (VO) setting, producing visual features that are phonetically related. Next, we employ K-means to discretize the continuous audio and visual feature spaces. The discretization step allows DFVGM to capture high-level semantic structures that are more resilient to noise and generate visual hallucinations with high quality. To evaluate the effectiveness and robustness of our approach, we conduct extensive experiments on two publicly available datasets. The results demonstrate that our method achieves a remarkable 53% relative reduction (30.5%->12.9%) in Word Error Rate (WER) on average compared to the current state-of-the-art Audio-Only (AO) baselines while maintaining comparable results (< 5% difference) under the Audio-Visual (AV) setting even without video as input.

show abstract