ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054029
|View full text |Cite
|
Sign up to set email alerts
|

End-To-End Multi-Speaker Speech Recognition With Transformer

Abstract: Recently, fully recurrent neural network (RNN) based endto-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
51
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 76 publications
(51 citation statements)
references
References 30 publications
0
51
0
Order By: Relevance
“…The third mainstream method is the joint training methods [21], [26]- [28]. These methods apply the joint training framework to optimize the speech enhancement and recognition, simultaneously.…”
Section: Introductionmentioning
confidence: 99%
“…The third mainstream method is the joint training methods [21], [26]- [28]. These methods apply the joint training framework to optimize the speech enhancement and recognition, simultaneously.…”
Section: Introductionmentioning
confidence: 99%
“…One approach to make these devices robust against noise is to equip them with multiple microphones so that the spectral and spatial diversity of the target and interference signals can be leveraged using beamforming approaches [1][2][3][4][5][6]. It has been demonstrated in [4,6,7] that beamforming methods for multi-channel speech enhancement produce substantial improvements for ASR systems; therefore, existing ASR pipelines are mainly built on beamforming as a pre-processor and then cascaded with an acoustic-to-text model [2,[8][9][10].…”
Section: Introductionmentioning
confidence: 99%
“…The neural Filter&Sum approaches directly estimate the beamforming filter parameters in either time domain [16]- [18] or frequency domain [19] to produce the separated outputs. The mask-based MVDR [4]- [6], [20]- [23] and related mask-based GEV [24], [25] approaches predict the TF masks using DNNs before estimating the power spectral density (PSD) matrices for the target and overlapping speakers to obtain the beamforming filter parameters. Compared with the conventional stand-alone beamforming approaches, these neural based methods allow a arXiv:2011.07755v1 [eess.AS] 16 Nov 2020 tighter integration with the downstream recognition back-end [5], [6], [19], [25], [26].…”
Section: Introductionmentioning
confidence: 99%
“…The mask-based MVDR [4]- [6], [20]- [23] and related mask-based GEV [24], [25] approaches predict the TF masks using DNNs before estimating the power spectral density (PSD) matrices for the target and overlapping speakers to obtain the beamforming filter parameters. Compared with the conventional stand-alone beamforming approaches, these neural based methods allow a arXiv:2011.07755v1 [eess.AS] 16 Nov 2020 tighter integration with the downstream recognition back-end [5], [6], [19], [25], [26]. Large performance improvements have been reported for overlapped speech recognition tasks by using microphone array based multi-channel inputs [5], [6].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation