2020
DOI: 10.48550/arxiv.2006.10388
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-supervised Learning for Speech Enhancement

Yu-Che Wang,
Shrikant Venkataramani,
Paris Smaragdis

Abstract: Supervised learning for single-channel speech enhancement requires carefully labeled training examples where the noisy mixture is input into the network and the network is trained to produce an output close to the ideal target. To relax the conditions on the training data, we consider the task of training speech enhancement networks in a selfsupervised manner. We first use a limited training set of clean speech sounds and learn a latent representation by autoencoding on their magnitude spectrograms. We then au… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 11 publications
(11 citation statements)
references
References 15 publications
0
11
0
Order By: Relevance
“…Towards speech separation, a model may train in an unsupervised permutation-invariant way by indefinitely mixing mixtures and separating them into an arbitrary number of sources minimizing signal-to-noise ratio (SNR) loss (Wisdom et al, 2020). A recent work applies self-supervised learning directly towards speech enhancement-the authors train an autoencoder on unlabeled noisy data, coupling its parameters with another autoencoder pretrained on clean data (Wang et al, 2020b). In all prior works, the exact amount of data augmentation is left unquantified-more specifically, minimizing the required amount of speakerspecific clean speech data is not addressed.…”
Section: Related Workmentioning
confidence: 99%
“…Towards speech separation, a model may train in an unsupervised permutation-invariant way by indefinitely mixing mixtures and separating them into an arbitrary number of sources minimizing signal-to-noise ratio (SNR) loss (Wisdom et al, 2020). A recent work applies self-supervised learning directly towards speech enhancement-the authors train an autoencoder on unlabeled noisy data, coupling its parameters with another autoencoder pretrained on clean data (Wang et al, 2020b). In all prior works, the exact amount of data augmentation is left unquantified-more specifically, minimizing the required amount of speakerspecific clean speech data is not addressed.…”
Section: Related Workmentioning
confidence: 99%
“…Instead, a selfsupervised learning approach may be better suited; this works by optimizing the model based on a pretext task which proxies the intended task [19]. This paradigm has seen extensive usage in computer vision research [20,21], with even recent studies applying the concept towards speaker-agnostic speech enhancement [22]; our paper investigates self-supervised learning uniquely with regards to speaker-specific, thus personalized speech enhancement.…”
Section: Introductionmentioning
confidence: 99%
“…Although it is tempting to calculate the loss through a no-reference speech quality prediction network [5], experiments have shown that DNNs might over-optimize one perceptual metric without necessarily improving others [6,7], let alone a prediction of them. Wang et al used a pair of generative adversarial networks to map speech signals from noisy to clean [8]. The trained generator is then used to generate a pseudo reference signal.…”
Section: Introductionmentioning
confidence: 99%
“…These studies were inspired by unpaired image-to-image translation through cycle-consistency constraints [10]. However, in [8] the cycle-consistency constraint did not enforce clean speech embeddings and degraded speech embeddings to share the same latent space by using multiple encoders.…”
Section: Introductionmentioning
confidence: 99%