Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1938
|View full text |Cite
|
Sign up to set email alerts
|

Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration

Abstract: The state-of-the-art neural network architecture named Transformer has been used successfully for many sequence-tosequence transformation tasks. The advantage of this architecture is that it has a fast iteration speed in the training stage because there is no sequential operation as with recurrent neural networks (RNN). However, an RNN is still the best option for end-to-end automatic speech recognition (ASR) tasks in terms of overall training speed (i.e., convergence) and word error rate (WER) because of effe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
149
0
2

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 178 publications
(160 citation statements)
references
References 15 publications
(43 reference statements)
0
149
0
2
Order By: Relevance
“…An RNN-based language model (LM) is employed via shallow fusion. The RNN-LM consists of 4 LSTM layers with 2048 units each [13], CTC prefix beam search decoding only [20], and attention beam search decoding only [3]. In addition, results for including the RNN-LM, for using data augmentation [25] as well as for the large transformer setup are shown.…”
Section: Datasetmentioning
confidence: 99%
See 2 more Smart Citations
“…An RNN-based language model (LM) is employed via shallow fusion. The RNN-LM consists of 4 LSTM layers with 2048 units each [13], CTC prefix beam search decoding only [20], and attention beam search decoding only [3]. In addition, results for including the RNN-LM, for using data augmentation [25] as well as for the large transformer setup are shown.…”
Section: Datasetmentioning
confidence: 99%
“…In addition, results for including the RNN-LM, for using data augmentation [25] as well as for the large transformer setup are shown. Table 1 presents ASR results of our transformer-based baseline systems, which are jointly trained with CTC to optimize training convergence and ASR accuracy [3,13]. Results of different decoding methods are shown with and without using the RNN-LM, SpecAugment [25], and the large transformer model.…”
Section: Datasetmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, Transformer [12] has gained success in ASR field [13,14,15]. Transformer-based models are parallelizable and competitive to recurrent neural networks [16].…”
Section: Introductionmentioning
confidence: 99%
“…Recently, Transformer models [15] have shown impressive performance in many tasks, such as pretrained language models [16,17], end-to-end speech recognition [18,19], and speaker diarization [20], surpassing the long short-term memory recurrent neural networks (LSTM-RNNs) based models. One of the key components in the Transformer model is self-attention, which computes the contribution information of the whole input sequence and maps the sequence into a vector at every time step.…”
Section: Introductionmentioning
confidence: 99%