Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1868
|View full text |Cite
|
Sign up to set email alerts
|

Personalized Speech Enhancement Through Self-Supervised Data Augmentation and Purification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
4

Relationship

1
9

Authors

Journals

citations
Cited by 20 publications
(10 citation statements)
references
References 0 publications
0
10
0
Order By: Relevance
“…The trained model, using either contrastive or non-contrastive SSL, is trained to recover premixture sources rather than clean speech, and hence, it requires fine-tuning for the downstream task. Data purification (DP) [126] is later introduced in the pseudo speech enhancement training. Specifically, a separate model is trained to estimate the segmental SNR of the premixture signals, measuring the different importance of the audio frames.…”
Section: Modelmentioning
confidence: 99%
“…The trained model, using either contrastive or non-contrastive SSL, is trained to recover premixture sources rather than clean speech, and hence, it requires fine-tuning for the downstream task. Data purification (DP) [126] is later introduced in the pseudo speech enhancement training. Specifically, a separate model is trained to estimate the segmental SNR of the premixture signals, measuring the different importance of the audio frames.…”
Section: Modelmentioning
confidence: 99%
“…We define PSE as enhancing only the target speaker's voice while removing any other sounds present in the input signal, including interfering speakers and environmental noises. This definition is different than [14], which defines PSE as fine-tuning an unconditional SE model on the target speaker's data without explicitly removing other speakers. The closest work in the literature is [6], which proposed Personalized PercepNet that runs in real-time and can remove the background speakers in addition to the noise.…”
Section: Related Workmentioning
confidence: 99%
“…In the mixture invariant training (MixIT) [18], separation models are trained with mixture of mixtures in both unsupervised and semi-supervised setups. An alternative approach to leverage the noisy mixtures is to train the network with pseudo-labels assigned by a pre-trained teacher model, with promising results in singing-voice separation [19,20] and speech enhancement [21,22]. Meanwhile, training with noisy data in TSE is yet to be explored.…”
Section: Introductionmentioning
confidence: 99%