2020
DOI: 10.48550/arxiv.2005.09684
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Exploring Transformers for Large-Scale Speech Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
11
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(11 citation statements)
references
References 16 publications
0
11
0
Order By: Relevance
“…Several studies suggest that down-sampling input representation using convolutional layers before processing with transformer layers provides better results for ASR [24,25]. Intuitively, convolutional layers use local context to produce bet-ter contextual features.…”
Section: Resnet+transformer Modelmentioning
confidence: 99%
“…Several studies suggest that down-sampling input representation using convolutional layers before processing with transformer layers provides better results for ASR [24,25]. Intuitively, convolutional layers use local context to produce bet-ter contextual features.…”
Section: Resnet+transformer Modelmentioning
confidence: 99%
“…Transformers [21] are powerful neural architectures that lately have been used in ASR [22][23][24], SLU [25], and other audio-visual applications [26] with great success, mainly due to their attention mechanism. Only until recently, the attention concept has also been applied to beamforming, specifically for speech and noise mask estimations [9,27].…”
Section: Introductionmentioning
confidence: 99%
“…End-to-end (E2E) automatic speech recognition (ASR) has made rapid progress in recent years [1,2,3,4,5,6,7]. Representative models include streaming models such as the recurrent neural network transducer (RNN-T) [1], attention-based models [8,2,3], and transformer-based models [9,10,11,12]. Compared to sophisticated conventional models [13,14], E2E models such as RNN-T and Listen, Attend and Spell (LAS) have shown competitive performance [6,5,7,15].…”
Section: Introductionmentioning
confidence: 99%
“…While long short-term memory (LSTM) has been a popular building block for E2E models, there has been a continuing success in applying transformer models [22] in ASR [23,11,10,9,24,25,4]. Instead of using a recurrent mechanism to model temporal dynamics, the transformer uses multi-headed attention to associate sequential elements in one step.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation