2023
DOI: 10.48550/arxiv.2303.03329
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

End-to-End Speech Recognition: A Survey

Abstract: In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning. In the wake of this transition, a number of all-neural ASR architectures were introduced. These so-called end-to-end (E2E) models provide highly integrated, completely neural ASR models, which rely strongly on general machine learning knowledge, learn more consistently from data, while depen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
8
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(8 citation statements)
references
References 289 publications
0
8
0
Order By: Relevance
“…Here we show, how a seemingly very simple modification makes the AED model streamable and turns out to be very robust and competitive, specifically on long-form speech, in contrast to many other AED and transducer models [8][9][10][11][12]. Interestingly, the small modification leads to an equivalence to transducer models, and we study the exact modeling differences.…”
mentioning
confidence: 85%
See 3 more Smart Citations
“…Here we show, how a seemingly very simple modification makes the AED model streamable and turns out to be very robust and competitive, specifically on long-form speech, in contrast to many other AED and transducer models [8][9][10][11][12]. Interestingly, the small modification leads to an equivalence to transducer models, and we study the exact modeling differences.…”
mentioning
confidence: 85%
“…This is different to the standard transducer training, which performs a full sum over all alignment paths. The standard transducer training criterion cannot be applied easily here due to the alignment label dependencies [9,12].…”
Section: Trainingmentioning
confidence: 99%
See 2 more Smart Citations
“…Modern end-to-end automatic speech recognition (E2E-ASR) systems have made remarkable strides, performing well across various types of data Gulati et al, 2020;Prabhavalkar et al, 2023). This success can be attributed to the advancement of deep learning techniques relying on different training strategies, highly dependent on large datasets.…”
Section: Introductionmentioning
confidence: 99%