End-to-End Speech Recognition: A Survey

Prabhavalkar, Rohit; Hori, Takaaki; Sainath, Tara N.; Schlüter, Ralf; Watanabe, Soichi

doi:10.48550/arxiv.2303.03329

Cited by 5 publications

(8 citation statements)

References 289 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here we show, how a seemingly very simple modification makes the AED model streamable and turns out to be very robust and competitive, specifically on long-form speech, in contrast to many other AED and transducer models [8][9][10][11][12]. Interestingly, the small modification leads to an equivalence to transducer models, and we study the exact modeling differences.…”

mentioning

confidence: 85%

“…This is different to the standard transducer training, which performs a full sum over all alignment paths. The standard transducer training criterion cannot be applied easily here due to the alignment label dependencies [9,12].…”

Section: Trainingmentioning

confidence: 99%

“…INTRODUCTION & RELATED WORK Among the potential streaming models, there are the traditional HMM [1], CTC [2] and more recently transducer [3]. While many streamable attention-based encoder-decoder (AED) models were proposed [4][5][6][7][8], they are too complicated, relying on too much heuristics and not being robust enough in comparison to the transducer [9].…”

mentioning

confidence: 99%

“…Similar chunking in the decoder has been done in [14][15][16][17][18][19] and similar chunking in the encoder has been done in [7,[20][21][22][23][24][25][26][27][28][29]. There are also other approaches to make self-attention in the encoder streamable [9,30,31].…”

mentioning

confidence: 99%

See 3 more Smart Citations

Investigating Methods to Improve Language Model Integration for Attention-Based Encoder-Decoder ASR Models

Zeineldeen

Glushko²,

Michel

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transducer model that operates on chunks instead of frames, where EOC corresponds to the blank symbol. We further explore the remaining differences between a standard transducer and our model. Additionally, we examine relevant aspects such as long-form speech generalization, beam size, and length normalization. Through experiments on Librispeech and TED-LIUM-v2, and by concatenating consecutive sequences for long-form trials, we find that our streamable model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.

show abstract

mentioning

confidence: 85%

Section: Trainingmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Investigating Methods to Improve Language Model Integration for Attention-Based Encoder-Decoder ASR Models

Zeineldeen

Glushko²,

Michel

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…Modern end-to-end automatic speech recognition (E2E-ASR) systems have made remarkable strides, performing well across various types of data Gulati et al, 2020;Prabhavalkar et al, 2023). This success can be attributed to the advancement of deep learning techniques relying on different training strategies, highly dependent on large datasets.…”

Section: Introductionmentioning

confidence: 99%

Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

2023

View full text Add to dashboard Cite

Although Spanish is one of the most spoken languages in the world, it was only until very recently that the development of linguistic technologies for it had a strong boost. However, this is not entirely true for some of its Latin American variants, such as the Mexican Spanish, which show phonetic, and also some lexical and semantic differences with respect to peninsular Spanish. This talk will focus on presenting the development of NLP for Mexican Spanish, emphasizing the path taken through the organization of different evaluation campaigns. It will present some data about Mexican Spanish as well as about the impact of the organization of shared tasks in the context of IberLEF for the development of the NLP area in our country, first as a mechanism to motivate more students to get involved, and then as a vehicle to build resources and design and implement specific methods. The talk will conclude by exposing some of the obstacles faced, our main achievements, and some plans for the coming years.Bio: Manuel Montes is Full Professor at the National Institute of Astrophysics, Optics and Electronics (INAOE) of Mexico. His research is on automatic text processing. He is author of more than 250 journal and conference papers in the fields of information retrieval, text mining and authorship analysis. He has been visiting professor at the Polytechnic University of Valencia (Spain), and the University of Alabama (USA). He is also member of the Mexican Academy of Sciences (AMC), and founding member of the Mexican Academy of Computer Science (AMEXCOMP), and the Mexican Association of Natural Language Processing (AMNLP). In the context of the latter, he has been the organizer of

show abstract