ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053896
|View full text |Cite
|
Sign up to set email alerts
|

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

Abstract: In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on selfattention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

3
237
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 364 publications
(241 citation statements)
references
References 19 publications
3
237
1
Order By: Relevance
“…Therefore, several models based on attention mechanisms have been proposed to make it possible for Transducer models to exploit contextual information. Transformer-Transducer (T-T) [15,16] has been proposed on speech recognition, with Transformer [17] becoming the state-of-the-art approach in the language modeling and machine translations fields [18][19][20]. They replaced LSTM with the encoder part of Transformer, which mainly includes multi-head attention mechanisms, feedforward networks, and layer normalization, have been proposed on speech recognition.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Therefore, several models based on attention mechanisms have been proposed to make it possible for Transducer models to exploit contextual information. Transformer-Transducer (T-T) [15,16] has been proposed on speech recognition, with Transformer [17] becoming the state-of-the-art approach in the language modeling and machine translations fields [18][19][20]. They replaced LSTM with the encoder part of Transformer, which mainly includes multi-head attention mechanisms, feedforward networks, and layer normalization, have been proposed on speech recognition.…”
Section: Introductionmentioning
confidence: 99%
“…Experiments that are based on T-T show that the accuracy of the streaming model considering contextual information is comparable to that of the offline models. Truncated self-attention adopted in [15] and masked self-attention adopted in [16] both reduce the error rate of the streaming model.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Another researches are worked on local monotonic attention [24] [25]. Google proposed transformer encoders with RNN-T loss [26], and they showed that limiting the left and right context of attention per-layer can obtain not bad accuracy but still have some gap between the performance of full-attention models.…”
Section: Introductionmentioning
confidence: 99%
“…Most conventional ASR systems [2,3] consist of various modules, such as the acoustic model, language model, and pronunciation dictionary, which are trained separately. In recent years, end-to-end ASR systems [4][5][6][7], which can be directly trained to maximize the probability of a word sequence given an acoustic feature sequence, have been the research focus. Many researchers [7,8] reported that end-to-end ASR systems can significantly simplify the speech recognition pipelines and outperform the conventional ASR systems on several representative speech datasets.…”
Section: Introductionmentioning
confidence: 99%