“…End-to-end (E2E) automatic speech recognition (ASR) has made rapid progress in recent years [1,2,3,4,5,6,7]. Representative models include streaming models such as the recurrent neural network transducer (RNN-T) [1], attention-based models [8,2,3], and transformer-based models [9,10,11,12]. Compared to sophisticated conventional models [13,14], E2E models such as RNN-T and Listen, Attend and Spell (LAS) have shown competitive performance [6,5,7,15].…”