Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-305
|View full text |Cite
|
Sign up to set email alerts
|

Recognizing Multi-Talker Speech with Permutation Invariant Training

Abstract: In this paper, we propose a novel technique for direct recognition of multiple speech streams given the single channel of mixed speech, without first separating them. Our technique is based on permutation invariant training (PIT) for automatic speech recognition (ASR). In PIT-ASR, we compute the average cross entropy (CE) over all frames in the whole utterance for each possible output-target assignment, pick the one with the minimum CE, and optimize for that assignment. PIT-ASR forces all the frames of the sam… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
77
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 81 publications
(77 citation statements)
references
References 22 publications
0
77
0
Order By: Relevance
“…The signal-to-noise ratio (SNR) of one source against the other was randomly chosen from a uniform distribution in the range of [−5, 5] dB. The validation and evaluation sets were generated in a similar way by selecting source utterances from the WSJ Dev93 and Eval92 respectively, and the durations are 1.3 h and 0.8 h. We then create a new spatialized version of the wsj1-2mix dataset following the process applied to the wsj0-2mix dataset in [17], using a room impulse response (RIR) generator 1 , where the characteristics of each two-speaker mixture 1 Available online at https://github.com/ehabets/ RIR-Generator Algorithm 1: Curriculum learning strategy 1 Load the training dataset X; 2 Categorize the training data X into single-channel single-speaker data Xclean and multi-channel multi-speaker data Xnoisy; 3 Sort the single-channel single-speaker training data in Xclean in ascending order of the utterance lengths, leading to X clean ; 4 Sort the multi-channel multi-speaker training data in Xnoisy in ascending order of the SNR level, leading to X noisy ; 5 Divide X clean and X noisy into minibatch sets Bclean and Bnoisy; 6 Sort batches to alternate between batches from Bclean and Bnoisy; 7 while model is not converged do 8 for each b in all minibatches do 9 Feed minibatch b into the model, update the model; To train the model, we used the spatialized wsj1-2mix data with J = 2 speakers as well as the train si284 training set from the WSJ1 dataset to regularize the training procedure. All input data are raw waveform audio signals.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The signal-to-noise ratio (SNR) of one source against the other was randomly chosen from a uniform distribution in the range of [−5, 5] dB. The validation and evaluation sets were generated in a similar way by selecting source utterances from the WSJ Dev93 and Eval92 respectively, and the durations are 1.3 h and 0.8 h. We then create a new spatialized version of the wsj1-2mix dataset following the process applied to the wsj0-2mix dataset in [17], using a room impulse response (RIR) generator 1 , where the characteristics of each two-speaker mixture 1 Available online at https://github.com/ehabets/ RIR-Generator Algorithm 1: Curriculum learning strategy 1 Load the training dataset X; 2 Categorize the training data X into single-channel single-speaker data Xclean and multi-channel multi-speaker data Xnoisy; 3 Sort the single-channel single-speaker training data in Xclean in ascending order of the utterance lengths, leading to X clean ; 4 Sort the multi-channel multi-speaker training data in Xnoisy in ascending order of the SNR level, leading to X noisy ; 5 Divide X clean and X noisy into minibatch sets Bclean and Bnoisy; 6 Sort batches to alternate between batches from Bclean and Bnoisy; 7 while model is not converged do 8 for each b in all minibatches do 9 Feed minibatch b into the model, update the model; To train the model, we used the spatialized wsj1-2mix data with J = 2 speakers as well as the train si284 training set from the WSJ1 dataset to regularize the training procedure. All input data are raw waveform audio signals.…”
Section: Methodsmentioning
confidence: 99%
“…estimate a mask for every speaker with a permutation-free objective function that minimizes the reconstruction loss. PIT was later applied to multi-speaker automatic speech recognition (ASR) by directly optimizing a speech recognition loss [8,9] within a DNN-HMM hybrid ASR framework. In recent years, end-to-end models have drawn a lot of attention in single-speaker ASR systems and shown great success [10][11][12][13].…”
Section: Introductionmentioning
confidence: 99%
“…Based on these source separation techniques, multi-speaker ASR systems have been constructed. DPCL and PIT have been used as frequency domain source separation front-ends for a state-of-theart single-speaker ASR system and extended to jointly trained E2E or hybrid systems [7,8,9,10]. They showed that joint (re-)training can improve the performance of these models over a simple cascade system.…”
Section: Introductionmentioning
confidence: 99%
“…For such overlapped speech, neither conventional ASR nor speaker diarization provides a result with sufficient accuracy. It is known that mixing two speech significantly degrades ASR accuracy [4][5][6]. In addition, no speaker overlaps are assumed with most conventional speaker diarization techniques, such as clustering of speech partitions (e.g.…”
Section: Introductionmentioning
confidence: 99%