Adapting Speech Separation to Real-World Meetings Using Mixture Invariant Training

Sivaraman, Aswin; Wisdom, Scott; Erdoğan, Hakan; Hershey, John R.

doi:10.48550/arxiv.2110.10739

Cited by 1 publication

(1 citation statement)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To this end, some latest works start to separate speech from unsupervised or semi-supervised perspectives. In [28]- [30], a mixture invariant training (MixIT) that requires only single-channel real acoustic mixtures was proposed. MixIT uses mixtures of mixtures (MoMs) as input, and sums over estimated sources to match the target mixtures instead of the single-source references.…”

Section: Introductionmentioning

confidence: 99%

Heterogeneous Separation Consistency Training for Adaptation of Unsupervised Speech Separation

Han¹,

Long²

2022

Preprint

View full text Add to dashboard Cite

Recently, supervised speech separation has made great progress. However, limited by the nature of supervised training, most existing separation methods require ground-truth sources and are trained on synthetic datasets. This groundtruth reliance is problematic, because the ground-truth signals are usually unavailable in real conditions. Moreover, in many industry scenarios, the real acoustic characteristics deviate far from the ones in simulated datasets. Therefore, the performance usually degrades significantly when applying the supervised speech separation models to real applications. To address these problems, in this study, we propose a novel separation consistency training, termed SCT, to exploit the real-world unlabeled mixtures for improving cross-domain unsupervised speech separation in an iterative manner, by leveraging upon the complementary information obtained from heterogeneous (structurally distinct but behaviorally complementary) models. SCT follows a framework using two heterogeneous neural networks (HNNs) to produce high confidence pseudo labels of unlabeled real speech mixtures. These labels are then updated, and used to refine the HNNs to produce more reliable consistent separation results for real mixture pseudo-labeling. To maximally utilize the large complementary information between different separation networks, a cross-knowledge adaptation is further proposed. Together with simulated dataset, those real mixtures with high confidence pseudo labels are then used to update the HNN separation models iteratively. In addition, we find that combing the heterogeneous separation outputs by a simple linear fusion can further slightly improve the final system performance. The proposed SCT is evaluated on both public reverberant English and anechoic Mandarin cross-domain separation tasks. Results show that, without any available ground-truth of target domain mixtures, the SCT can still significantly outperform our two strong baselines with up to 1.61 dB and 3.44 dB performance improvements, on the English and Mandarin cross-domain conditions respectively.

show abstract

Section: Introductionmentioning

confidence: 99%