ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413431
|View full text |Cite
|
Sign up to set email alerts
|

Improving Multimodal Speech Enhancement by Incorporating Self-Supervised and Curriculum Learning

Abstract: Speech enhancement in realistic scenarios still remains many challenges, such as complex background signals and data limitations. In this paper, we present a co-attention based framework that incorporates self-supervised and curriculum learning to derive the target speech in noisy environments. Specifically, we first leverage self-supervision to pre-train the coattention model on the task of audio-visual synchronization.The pre-trained model can focus on the lip of speakers automatically, and then the self-sup… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 25 publications
0
1
0
Order By: Relevance
“…However, in real-world settings, the speech recordings are degraded with additive noise, thus, self-learning robust embeddings becomes particularly challenging and demands the adaptation to the input noise distribution [20]. Several unsupervised speech denoising algorithms have been proposed by identifying and training with relatively clean segments of the noisy speech mixture [21], [22], using ASR losses [23], [24] exploiting visual cues [25], and harnessing the spatial separability of the sources using mic-arrays [26], [27]. Mixture invariant training (MixIT) [28] enables unsupervised training of separation models only with real-world singlechannel recordings by generating artificial mixtures of mixtures and estimating the independent sources.…”
mentioning
confidence: 99%
“…However, in real-world settings, the speech recordings are degraded with additive noise, thus, self-learning robust embeddings becomes particularly challenging and demands the adaptation to the input noise distribution [20]. Several unsupervised speech denoising algorithms have been proposed by identifying and training with relatively clean segments of the noisy speech mixture [21], [22], using ASR losses [23], [24] exploiting visual cues [25], and harnessing the spatial separability of the sources using mic-arrays [26], [27]. Mixture invariant training (MixIT) [28] enables unsupervised training of separation models only with real-world singlechannel recordings by generating artificial mixtures of mixtures and estimating the independent sources.…”
mentioning
confidence: 99%