2018
DOI: 10.1016/j.specom.2018.09.003
|View full text |Cite
|
Sign up to set email alerts
|

Single-channel multi-talker speech recognition with permutation invariant training

Abstract: Although great progresses have been made in automatic speech recognition (ASR), significant performance degradation is still observed when recognizing multi-talker mixed speech. In this paper, we propose and evaluate several architectures to address this problem under the assumption that only a single channel of mixed signal is available. Our technique extends permutation invariant training (PIT) by introducing the frontend feature separation module with the minimum mean square error (MSE) criterion and the ba… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
51
1
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 65 publications
(53 citation statements)
references
References 39 publications
0
51
1
1
Order By: Relevance
“…Although multi-speaker source separation can already be performed by combining independently trained front-and back-end systems, the source separator produces artifacts unknown to the ASR system which disturb its performance. According to [19], and as also shown in [8,10], such a mismatch can be mitigated by jointly fine-tuning the whole model at once. We here compare three different variants of joint fine-tuning: (a) fine-tuning just the ASR system on the enhanced signals, (b) finetuning just the front-end by propagating gradients through the ASR system but only updating the front-end parameters and (c) jointly fine-tuning both systems.…”
Section: Joint End-to-end Multi-speaker Asrmentioning
confidence: 99%
See 2 more Smart Citations
“…Although multi-speaker source separation can already be performed by combining independently trained front-and back-end systems, the source separator produces artifacts unknown to the ASR system which disturb its performance. According to [19], and as also shown in [8,10], such a mismatch can be mitigated by jointly fine-tuning the whole model at once. We here compare three different variants of joint fine-tuning: (a) fine-tuning just the ASR system on the enhanced signals, (b) finetuning just the front-end by propagating gradients through the ASR system but only updating the front-end parameters and (c) jointly fine-tuning both systems.…”
Section: Joint End-to-end Multi-speaker Asrmentioning
confidence: 99%
“…Other works already studied the effectiveness of frequency domain source separation techniques as a front-end for ASR. DPCL and PIT have been efficiently used for this purpose, and it was shown that joint retraining for fine-tuning can improve performance [7,8,10]. E2E systems for single-channel multi-speaker ASR have been proposed that no longer consist of individual parts dedicated for source separation and speech recognition, but combine these functionalities into one large monolithic neural network.…”
Section: Relation To Prior Workmentioning
confidence: 99%
See 1 more Smart Citation
“…BLSTM is often used in uPIT-based speech separation systems for its capacity of modeling long time dependency in forward and backward directions [8], [13], [12], [14], [15], [16], [17], [18]. BLSTM has a high latency as long as the utterance.…”
Section: Csc-blstm and Lc-blstmmentioning
confidence: 99%
“…It is extended to utterance-level PIT (uPIT) [8] with an utterance-level cost function to further improve the performance. Because uPIT is simple and well-performed, it draws a lot of attention [6], [9], [12], [13], [14], [15], [16], [17], [18], [19]. LSTM [20], [21], [22] and BLSTM [23], [24] are widely used for uPIT to exploit utterance-level long time dependency.…”
mentioning
confidence: 99%