2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2021
DOI: 10.1109/waspaa52581.2021.9632783
|View full text |Cite
|
Sign up to set email alerts
|

Separate But Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data

Abstract: We propose FEDENHANCE, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients. We simulate a realworld scenario where each client only has access to a few noisy recordings from a limited and disjoint number of speakers (hence non-IID). Each client trains their model in isolation using mixture invariant training while periodically providing updates to a central server. Our experiments show that our approach achieves competitiv… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(9 citation statements)
references
References 31 publications
0
8
0
Order By: Relevance
“…If the noise sources are independent from each other and the clean speech component, then the model can learn to minimize this loss by reconstructing the mixture using its first estimated slot and either one of the two noise slots available. Although MixIT has been proven effective for various simulated speech enhancement setups [29], [30], the assumption about having access to a diverse set of in-domain noise recordings from D n which aptly captures the true distribution of the present background noises D * n make it impractical for many real-world settings. To this end, other works [24], [31] have tried to deal with the distribution shift between the on-hand noise dataset D n and the actual noise distribution D * n in order to avoid the need of in-domain noise samples.…”
Section: B Mixture Invariant Training (Mixit)mentioning
confidence: 99%
See 3 more Smart Citations
“…If the noise sources are independent from each other and the clean speech component, then the model can learn to minimize this loss by reconstructing the mixture using its first estimated slot and either one of the two noise slots available. Although MixIT has been proven effective for various simulated speech enhancement setups [29], [30], the assumption about having access to a diverse set of in-domain noise recordings from D n which aptly captures the true distribution of the present background noises D * n make it impractical for many real-world settings. To this end, other works [24], [31] have tried to deal with the distribution shift between the on-hand noise dataset D n and the actual noise distribution D * n in order to avoid the need of in-domain noise samples.…”
Section: B Mixture Invariant Training (Mixit)mentioning
confidence: 99%
“…The clean speech samples are drawn from the LibriSpeech [51] corpus and the noise recordings are taken from FSD50K [52] representing a set of almost 200 classes of background noises after excluding all the human-made sounds from the AudioSet ontology [53]. A detailed recipe of the dataset generation process is presented in [30]. LFSD becomes an ideal candidate for semi-supervised/SSL teacher pre-training on OOD data given its mixture diversity.…”
Section: Dns-challenge (Dns)mentioning
confidence: 99%
See 2 more Smart Citations
“…We sample the clips in the development set of FSD50k to simulate the noises for training and validation, and those in the evaluation set to simulate the noises for testing. Since our task is single-speaker speech enhancement, following [50] we filter out clips containing any sounds produced by humans, based on the provided sound event annotation of each clip. Such clips have annotations such as Human voice, Male speech and man speaking, Chuckle and chortle, Yell, etc 2 .…”
Section: A Dataset For Noisy-reverberant Speech Enhancementmentioning
confidence: 99%