VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

Gao, Ruohan; Grauman, Kristen

doi:10.1109/cvpr46437.2021.01524

Cited by 110 publications

(44 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several AVSR architectures have been proposed [4,10,17,13,16,22,23] which show that the improvement over ASR models is greater as the noise level increases, i.e., the SNR is lower. The same VSR architectures can also be used to improve the performance of audio-based models in a variety of applications like speech enhancement [24], speech separation [25,26], voice activity detection [27], active speaker detection [28] and speaker diarisation [29].…”

Section: Applicationsmentioning

confidence: 99%

Visual Speech Recognition for Multiple Languages in the Wild

Ma¹,

Petridis²,

Pantić³

2022

Preprint

View full text Add to dashboard Cite

Visual speech recognition (VSR) aims to recognise the content of speech based on the lip movements without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to larger training sets rather than the model design. In this work, we demonstrate that designing better models is equally important to using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations.We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show furthermore that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement.

show abstract

Section: Applicationsmentioning

confidence: 99%

Visual Speech Recognition for Multiple Languages in the Wild

Ma¹,

Petridis²,

Pantić³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Instead, we feed in a magnitude spectrogram, predict the target magnitude spectrograms, and generate the time-domain signals with Griffin Lim [25]. This baseline helps isolate the impact of our proposed cross-modal attention architecture compared to the common U-Net approach [15,22,23,46,71]. 5.…”

Section: Methodsmentioning

confidence: 99%

“…Multimodal fusion. One standard solution for audiovisual feature fusion is to represent audio as spectrograms, a matrix representation of the spectrum of frequencies of a signal as it varies with time, process them with a CNN, and concatenate with visual features from another CNN [12,18,22,23,46]. This fusion strategy is limited by using one global feature to represent the scene and thus supports only coarse-grained reasoning.…”

Section: Related Workmentioning

confidence: 99%

“…Prior work often models audio-visual inputs in a simplistic manner by representing the image feature with one single vector and concatenating it with the audio feature [12,14,18,22,23,46,70]. However, for visual acoustic matching, it is important to reason how different regions of the space contribute to the acoustics differently.…”

Section: Cross-modal Encodermentioning

confidence: 99%

“…Recent audio-visual work generates audio outputs by inferring spectrograms then using ISTFT reconstruction to obtain a waveform (e.g., [18,22,23,[69][70][71]). While sensible for source separation, where the target signal is a subset of the source signal, ratio mask prediction is inadequate for our task, because reverberation might occupy periods of silence in the input audio and the ratio will be unbounded (as we verify in results).…”

Section: Waveform Generation and Lossmentioning

confidence: 99%

See 2 more Smart Citations

Visual Acoustic Matching

Chen¹,

Gao²,

Calamia³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily supervised baselines.

show abstract