2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017
DOI: 10.1109/icassp.2017.7953075
|View full text |Cite
|
Sign up to set email alerts
|

Joint CTC-attention based end-to-end speech recognition using multi-task learning

Abstract: Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoderdecoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

7
509
0
1

Year Published

2018
2018
2020
2020

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 761 publications
(517 citation statements)
references
References 16 publications
7
509
0
1
Order By: Relevance
“…End-to-end (E2E) models [2,3,4,5,6,7,8,9] have gained large popularity in the automatic speech recognition (ASR) community over the last few years. These models replace components of a conventional ASR system, namely an acoustic (AM), pronunciation (PM) and language models (LM), with a single neural network.…”
Section: Introductionmentioning
confidence: 99%
“…End-to-end (E2E) models [2,3,4,5,6,7,8,9] have gained large popularity in the automatic speech recognition (ASR) community over the last few years. These models replace components of a conventional ASR system, namely an acoustic (AM), pronunciation (PM) and language models (LM), with a single neural network.…”
Section: Introductionmentioning
confidence: 99%
“…In this subsection, we briefly introduce the end-to-end singlechannel multi-speaker speech recognition model proposed in [8,9], shown in Fig.1. The model is an extension of the joint CTC/attentionbased encoder-decoder framework [14] to recognize multi-speaker speech. The input O = {o1, .…”
Section: Single-channel Multi-speaker Asrmentioning
confidence: 99%
“…For multi-channel multi-speaker speech recognition, an end-to-end system was proposed in [13], called MIMO-Speech because of the multi-channel input (MI) and multi-speaker output (MO). This system consists of a maskbased neural beamformer frontend, which explicitly separates the multi-speaker speech via beamforming, and an end-to-end speech recognition model backend based on the joint CTC/attention-based encoder-decoder [14] to recognize the separated speech streams. This end-to-end architecture is optimized via only the connectionist temporal classification (CTC) and cross-entropy (CE) losses in the backend ASR, but is nonetheless able to learn to develop relatively good separation abilities.…”
Section: Introductionmentioning
confidence: 99%
“…A CTC loss can be readily applicable for training MoChA Model, especially the encoder, because it also leads the alignment between input and output as a monotonic manner. Moreover CTC loss has the advantage of learning alignment in noisy environments and can help to quickly learn the alignment of the attention based model through joint training [18].…”
Section: Joint Ctc-ce Trainingmentioning
confidence: 99%