Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

Shangguan, Yuan; Prabhavalkar, Rohit; Su, Hang; Mahadeokar, Jay; Shi, Yangyang; Zhou, Jiatong; Wu, Chunyang; Le, Duc; Kalinli, Ozlem; Fuegen, Christian; Seltzer, Michael L.

doi:10.21437/interspeech.2021-1887

Cited by 17 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The latency of ASR systems at runtime imposes another formidable bottleneck on voice-driven conversational interfaces, especially as long as they use endpointing methods, where response planning only starts when an utterance end is detected with some probability. User-perceived latency is the single biggest determinant of people's satisfaction with voice assistants (Shangguan et al, 2021;Bijwadia et al, 2023). Collecting realistic latency data would require implementing the tested systems in a voice UX environments with human users, which is beyond the scope of this paper (but see Aylett et al ( 2023)).…”

Section: Limitationsmentioning

confidence: 99%

The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems

Liesenfeld,

Lopez,

Dingemanse

2023

Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue

View full text Add to dashboard Cite

Speech recognition systems are a key intermediary in voice-driven human-computer interaction. Although speech recognition works well for pristine monologic audio, real-life use cases in open-ended interactive settings still present many challenges. We argue that timing is mission-critical for dialogue systems, and evaluate 5 major commercial ASR systems for their conversational and multilingual support. We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge (study 1). This impacts especially the recognition of conversational words (study 2), and in turn has dire consequences for downstream intent recognition (study 3). Our findings help to evaluate the current state of conversational ASR, contribute towards multidimensional error analysis and evaluation, and identify phenomena that need most attention on the way to build robust interactive speech technologies.

show abstract

Section: Limitationsmentioning

confidence: 99%

The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems

Liesenfeld,

Lopez,

Dingemanse

2023

Proceedings of the 24th Meeting of the Special Interest Group on Discourse and Dialogue

View full text Add to dashboard Cite

show abstract

“…Sequence transducer models, such as the recurrent neural network transducer (RNN-T) [2,3], the Transformer transducer [10,11], and the Conformer transducer (Conformer-T) [12,13], are one of the most promising end-to-end models, especially in streaming scenarios because of its inherently streaming nature. In streaming speech recognition, latency is one of the primary performance metrics along with recognition accuracy, because a lower latency leads to a quick response of voice-enabled applications, and improves the user experience [14]. However, streaming transducer models tend to delay label emission so as to see more future context to predict labels more accurately, which leads to a large latency and a deteriorated user experience.…”

Section: Introductionmentioning

confidence: 99%

Minimum latency training of sequence transducers for streaming end-to-end speech recognition

Shinohara¹,

Watanabe²

2022

Interspeech 2022

View full text Add to dashboard Cite

Sequence transducers, such as the RNN-T and the Conformer-T, are one of the most promising models of end-to-end speech recognition, especially in streaming scenarios where both latency and accuracy are important. Although various methods, such as alignment-restricted training and FastEmit, have been studied to reduce the latency, latency reduction is often accompanied with a significant degradation in accuracy. We argue that this suboptimal performance might be caused because none of the prior methods explicitly model and reduce the latency. In this paper, we propose a new training method to explicitly model and reduce the latency of sequence transducer models. First, we define the expected latency at each diagonal line on the lattice, and show that its gradient can be computed efficiently within the forward-backward algorithm. Then we augment the transducer loss with this expected latency, so that an optimal trade-off between latency and accuracy is achieved. Experimental results on the WSJ dataset show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%, and outperforms conventional alignment-restricted training (110 ms) and FastEmit (67 ms) methods.

show abstract

Knowledge Distillation From Offline to Streaming Transducer: Towards Accurate and Fast Streaming Model by Matching Alignments

Mo,

Jeon,

Lee

et al. 2023

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

Cited by 17 publications

References 0 publications

The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems

The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems

Minimum latency training of sequence transducers for streaming end-to-end speech recognition

Knowledge Distillation From Offline to Streaming Transducer: Towards Accurate and Fast Streaming Model by Matching Alignments

Contact Info

Product

Resources

About