Teacher-Student MixIT for Unsupervised and Semi-Supervised Speech Separation

Zhang, Jisi; Zorilă, Cătălin; Doddipatla, Rama; Barker, Jon

doi:10.21437/interspeech.2021-1243

Cited by 12 publications

(2 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the model is trained to separate the MOMs into a variable number of latent sources, the separated sources can be remixed to approximate the original mixtures. Motivated by MixIT, authors in [31] proposed a teacher-student MixIT (TS-MixIT) to alleviate the over-separation problem in the original MixIT. It takes the unsupervised model trained by MixIT as a teacher model, then the estimated sources are filtered and selected as pseudotargets to further train a student model using standard permutation invariant training (PIT) [3].…”

Section: Introductionmentioning

confidence: 99%

Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation

Han¹,

Long²

2022

Preprint

View full text Add to dashboard Cite

Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This groundtruth reliance is problematic, because the ground-truth signals are usually unavailable in real conditions. Moreover, in many industry scenarios, the real acoustic characteristics deviate far from the ones in simulated datasets. Therefore, the performance usually degrades significantly when applying the supervised speech separation models to real applications. To address these problems, in this study, we propose a novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation in an iterative manner, by leveraging upon the complementary information obtained from heterogeneous (structurally distinct but behaviorally complementary) models. SCT follows a framework using two heterogeneous neural networks (HNNs) to produce high confidence pseudo labels of unlabeled real speech mixtures. These labels are then updated, and used to refine the HNNs to produce more reliable consistent separation results for real mixture pseudo-labeling. To maximally utilize the large complementary information between different separation networks, a cross-knowledge adaptation is further proposed. Together with simulated dataset, those real mixtures with high confidence pseudo labels are then used to update the HNN separation models iteratively. In addition, we find that combing the heterogeneous separation outputs by a simple linear fusion can further slightly improve the final system performance. The proposed SCT is evaluated on both public reverberant English and anechoic Mandarin cross-domain separation tasks. Results show that, without any available ground-truth of target domain mixtures, the SCT can still significantly outperform our two strong baselines with up to 1.61 dB and 3.44 dB performance improvements, on the English and Mandarin cross-domain conditions respectively.

show abstract

Section: Introductionmentioning

confidence: 99%

Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation

Han¹,

Long²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Mixup [32]) but have also been successfully applied to several audio tasks [33], [34]. In [35], a student model with a smaller number of estimated sources has been trained on a subset of outputs of a pre-trained MixIT model to solve the input SNR distribution mismatch. Furthermore, a student model could also perform test-time adaptation by using the teacher's estimated waveforms as targets [36].…”

mentioning

confidence: 99%

RemixIT: Continual Self-Training of Speech Enhancement Models via Bootstrapped Remixing

Tzinis

Adi

Ithapu

et al. 2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

We present RemixIT, a simple yet effective selfsupervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures. Then, by permuting the estimated clean and noise signals and remixing them together, we generate a new set of bootstrapped mixtures and corresponding pseudo-targets which are used to train the student network.Vice-versa, the teacher periodically refines its estimates using the updated parameters of the latest student models. Experimental results on multiple speech enhancement datasets and tasks not only show the superiority of our method over prior approaches but also showcase that RemixIT can be combined with any separation model as well as be applied towards any semi-supervised and unsupervised domain adaptation task. Our analysis, paired with empirical evidence, sheds light on the inside functioning of our selftraining scheme wherein the student model keeps obtaining better performance while observing severely degraded pseudo-targets.

show abstract

Deep neural network techniques for monaural speech enhancement and separation: state of the art analysis

Ochieng

2023

Artif Intell Rev

View full text Add to dashboard Cite

Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in tasks such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement and separation to perform speech denoising, dereverberation, speaker extraction and speaker separation. In this paper, we review the current DNN techniques being employed to achieve speech enhancement and separation. The review looks at the whole pipeline of speech enhancement and separation techniques from feature extraction, how DNN-based tools models both global and local features of speech, model training (supervised and unsupervised) to how they address label ambiguity problem. The review also covers the use of domain adaptation techniques and pre-trained models to boost speech enhancement process. By this, we hope to provide an all inclusive reference of all the state of art DNN based techniques being applied in the domain of speech separation and enhancement. We further discuss future research directions. This survey can be used by both academic researchers and industry practitioners working in speech separation and enhancement domain.

show abstract

Teacher-Student MixIT for Unsupervised and Semi-Supervised Speech Separation

Cited by 12 publications

References 26 publications

Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation

Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation

RemixIT: Continual Self-Training of Speech Enhancement Models via Bootstrapped Remixing

Deep neural network techniques for monaural speech enhancement and separation: state of the art analysis

Contact Info

Product

Resources

About