Adapting Speech Separation to Real-World Meetings using Mixture Invariant Training

Sivaraman, Aswin; Wisdom, Scott; Erdoğan, Hakan; Hershey, John R.

doi:10.1109/icassp43922.2022.9747855

Cited by 11 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(ii) show that exploiting out-of-domain clean speech as well as in-domain real noisy data during the training of the enhancement network yields significant recognition gains for real test samples; (iii) use the MixIT framework for both types of data (instead of switching to supervised training when using outof-domain clean speech as described in [9]) by modifying the remixing matrix A such that the reference speech can be reconstructed by the first output channel alone and the non-speech signal is reconstructed by a sum of channels 2 and 3; (iv) exploit speaker reinforcement post-processing to mask processing artifacts and further improve ASR accuracy.…”

Section: Main Contributionsmentioning

confidence: 99%

“…In real situations, where paired noisy and clean signals are not available, we may instead look to use unpaired noisy and clean speech data. Several training strategies have been developed for using such data based on adversarial learning [7,8] and transfer learning [9][10][11]. For the adversarial training, discriminator networks are used to distinguish the enhanced and noised features from the clean and noisy ones, respectively [7].…”

Section: Introductionmentioning

confidence: 99%

“…MixIT has been demonstrated to be effective for reducing noise in a noisy speech signal and has achieved comparable results in terms of signal quality measurement, compared to a fully-supervised system. In [9], to adapt a speech separation model to real far-field speech data recorded in meeting scenarios, MixIT and a conventional supervised training framework are combined to exploit both mismatched synthetic data and matched real data. However, the separated signals in [9] have been evaluated subjectively by human listeners and it is still unknown whether the unsupervised model could benefit a speech recognizer with real noisy data.…”

Section: Introductionmentioning

confidence: 99%

“…In [9], to adapt a speech separation model to real far-field speech data recorded in meeting scenarios, MixIT and a conventional supervised training framework are combined to exploit both mismatched synthetic data and matched real data. However, the separated signals in [9] have been evaluated subjectively by human listeners and it is still unknown whether the unsupervised model could benefit a speech recognizer with real noisy data. To the best of our knowledge, MixIT has not been explored to jointly use unpaired clean speech and real noisy data for monoaural speech enhancement.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training

Zhang,

Zorila,

Doddipatla

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we explore an improved framework to train a monoaural neural enhancement model for robust speech recognition. The designed training framework extends the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data. It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech. The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts. Experiments on the single-channel CHiME-3 real test sets show that the proposed method improves significantly in terms of speech recognition performance over the enhancement system trained either on the mismatched simulated data in a supervised fashion or on the matched real data in an unsupervised fashion. Between 16% and 39% relative WER reduction has been achieved by the proposed system compared to the unprocessed signal using end-to-end and hybrid acoustic models without retraining on distorted data.

show abstract

Section: Main Contributionsmentioning

confidence: 99%