Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-2638
|View full text |Cite
|
Sign up to set email alerts
|

Exploring Transformers for Large-Scale Speech Recognition

Abstract: While recurrent neural networks still largely define state-of-theart speech recognition systems, the Transformer network has been proven to be a competitive alternative, especially in the offline condition. Most studies with Transformers have been constrained in a relatively small scale setting, and some forms of data argumentation approaches are usually applied to combat the data sparsity issue. In this paper, we aim at understanding the behaviors of Transformers in the large-scale speech recognition setting,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
9
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(10 citation statements)
references
References 22 publications
0
9
1
Order By: Relevance
“…When absolute embedding is used, FRR=1.1% at the same FAH. This contradicts the observations in [21,22] where same-layer dependency was found to be more advantageous for ASR and it was attributed to the fact that the receptive field is maximized at every layer 8 . A better way of incorporating relative positional information for this case is our future work.…”
Section: Streaming Transformers With Same-layer Dependencycontrasting
confidence: 69%
See 1 more Smart Citation
“…When absolute embedding is used, FRR=1.1% at the same FAH. This contradicts the observations in [21,22] where same-layer dependency was found to be more advantageous for ASR and it was attributed to the fact that the receptive field is maximized at every layer 8 . A better way of incorporating relative positional information for this case is our future work.…”
Section: Streaming Transformers With Same-layer Dependencycontrasting
confidence: 69%
“…The time and space complexity are both reduced to O(T ), and the within-chunk computation across time can be parallelized with GPUs. While there has been recent work [18,19,20,21,22] with similar ideas showing that such streaming Transformers achieve competitive performance compared with latency-controlled BiLSTMs [23] or non-streaming Transformers for ASR, it remains unclear how the streaming transformers work for shorter sequence modeling task like wake word detection.…”
Section: Introductionmentioning
confidence: 99%
“…First, Emformer removes the duplicated computation from the left context block by caching the key and value in previous segments' self-attention. Second, rather than passing the memory bank within the current layer in AM-TRF, inspired by transformer-xl [2] and its applicatin in speech recognition [20], Emformer carries over the memory bank from the lower layer. Third, Emformer disables the summary vector's attention with memory bank to avoid overweighting the most left part of context information.…”
Section: Introductionmentioning
confidence: 99%
“…Transformers [21] are powerful neural architectures that lately have been used in ASR [22][23][24], SLU [25], and other audio-visual applications [26] with great success, mainly due to their attention mechanism. Only until recently, the attention concept has also been applied to beamforming, specifically for speech and noise mask estimations [9,27].…”
Section: Introductionmentioning
confidence: 99%