2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01524
|View full text |Cite
|
Sign up to set email alerts
|

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
43
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 110 publications
(44 citation statements)
references
References 51 publications
0
43
0
Order By: Relevance
“…Several AVSR architectures have been proposed [4,10,17,13,16,22,23] which show that the improvement over ASR models is greater as the noise level increases, i.e., the SNR is lower. The same VSR architectures can also be used to improve the performance of audio-based models in a variety of applications like speech enhancement [24], speech separation [25,26], voice activity detection [27], active speaker detection [28] and speaker diarisation [29].…”
Section: Applicationsmentioning
confidence: 99%
“…Several AVSR architectures have been proposed [4,10,17,13,16,22,23] which show that the improvement over ASR models is greater as the noise level increases, i.e., the SNR is lower. The same VSR architectures can also be used to improve the performance of audio-based models in a variety of applications like speech enhancement [24], speech separation [25,26], voice activity detection [27], active speaker detection [28] and speaker diarisation [29].…”
Section: Applicationsmentioning
confidence: 99%
“…Instead, we feed in a magnitude spectrogram, predict the target magnitude spectrograms, and generate the time-domain signals with Griffin Lim [25]. This baseline helps isolate the impact of our proposed cross-modal attention architecture compared to the common U-Net approach [15,22,23,46,71]. 5.…”
Section: Methodsmentioning
confidence: 99%
“…Multimodal fusion. One standard solution for audiovisual feature fusion is to represent audio as spectrograms, a matrix representation of the spectrum of frequencies of a signal as it varies with time, process them with a CNN, and concatenate with visual features from another CNN [12,18,22,23,46]. This fusion strategy is limited by using one global feature to represent the scene and thus supports only coarse-grained reasoning.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations