Recognizing Multi-Talker Speech with Permutation Invariant Training

Yu, Dong; Chang, Xuankai; Ye, Qian

doi:10.21437/interspeech.2017-305

Cited by 81 publications

(77 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The signal-to-noise ratio (SNR) of one source against the other was randomly chosen from a uniform distribution in the range of [−5, 5] dB. The validation and evaluation sets were generated in a similar way by selecting source utterances from the WSJ Dev93 and Eval92 respectively, and the durations are 1.3 h and 0.8 h. We then create a new spatialized version of the wsj1-2mix dataset following the process applied to the wsj0-2mix dataset in [17], using a room impulse response (RIR) generator 1 , where the characteristics of each two-speaker mixture 1 Available online at https://github.com/ehabets/ RIR-Generator Algorithm 1: Curriculum learning strategy 1 Load the training dataset X; 2 Categorize the training data X into single-channel single-speaker data Xclean and multi-channel multi-speaker data Xnoisy; 3 Sort the single-channel single-speaker training data in Xclean in ascending order of the utterance lengths, leading to X clean ; 4 Sort the multi-channel multi-speaker training data in Xnoisy in ascending order of the SNR level, leading to X noisy ; 5 Divide X clean and X noisy into minibatch sets Bclean and Bnoisy; 6 Sort batches to alternate between batches from Bclean and Bnoisy; 7 while model is not converged do 8 for each b in all minibatches do 9 Feed minibatch b into the model, update the model; To train the model, we used the spatialized wsj1-2mix data with J = 2 speakers as well as the train si284 training set from the WSJ1 dataset to regularize the training procedure. All input data are raw waveform audio signals.…”

Section: Methodsmentioning

confidence: 99%

“…estimate a mask for every speaker with a permutation-free objective function that minimizes the reconstruction loss. PIT was later applied to multi-speaker automatic speech recognition (ASR) by directly optimizing a speech recognition loss [8,9] within a DNN-HMM hybrid ASR framework. In recent years, end-to-end models have drawn a lot of attention in single-speaker ASR systems and shown great success [10][11][12][13].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition

Chang

Zhang

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

102

View full text Add to dashboard Cite

Recently, the end-to-end approach has proven its efficacy in monaural multi-speaker speech recognition. However, high word error rates (WERs) still prevent these systems from being used in practical applications. On the other hand, the spatial information in multi-channel signals has proven helpful in far-field speech recognition tasks. In this work, we propose a novel neural sequence-tosequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition. MIMO-Speech is a fully neural end-toend framework, which is optimized only via an ASR criterion. It is comprised of: 1) a monaural masking network, 2) a multi-source neural beamformer, and 3) a multi-output speech recognition model. With this processing, the input overlapped speech is directly mapped to text sequences. We further adopted a curriculum learning strategy, making the best use of the training set to improve the performance. The experiments on the spatialized wsj1-2mix corpus show that our model can achieve more than 60% WER reduction compared to the single-channel system with high quality enhanced signals (SI-SDR = 23.1 dB) obtained by the above separation function.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition

Chang

Zhang

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

102

View full text Add to dashboard Cite

show abstract

“…Based on these source separation techniques, multi-speaker ASR systems have been constructed. DPCL and PIT have been used as frequency domain source separation front-ends for a state-of-theart single-speaker ASR system and extended to jointly trained E2E or hybrid systems [7,8,9,10]. They showed that joint (re-)training can improve the performance of these models over a simple cascade system.…”

Section: Introductionmentioning

confidence: 99%

End-to-End Training of Time Domain Audio Separation and Recognition

Neumann

Kinoshita

Drude

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multispeaker speech recognition. However, up until now, state-of-theart neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0 % on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far.

show abstract

“…For such overlapped speech, neither conventional ASR nor speaker diarization provides a result with sufficient accuracy. It is known that mixing two speech significantly degrades ASR accuracy [4][5][6]. In addition, no speaker overlaps are assumed with most conventional speaker diarization techniques, such as clustering of speech partitions (e.g.…”

Section: Introductionmentioning

confidence: 99%

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Kanda

Horiguchi

Fujita

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both the word error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 % from that of TS-ASR given oracle speaker embeddings. Furthermore, our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.

show abstract

Recognizing Multi-Talker Speech with Permutation Invariant Training

Cited by 81 publications

References 22 publications

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition

MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition

End-to-End Training of Time Domain Audio Separation and Recognition

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Contact Info

Product

Resources

About