Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1887
|View full text |Cite
|
Sign up to set email alerts
|

Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 17 publications
(2 citation statements)
references
References 0 publications
0
1
0
Order By: Relevance
“…The latency of ASR systems at runtime imposes another formidable bottleneck on voice-driven conversational interfaces, especially as long as they use endpointing methods, where response planning only starts when an utterance end is detected with some probability. User-perceived latency is the single biggest determinant of people's satisfaction with voice assistants (Shangguan et al, 2021;Bijwadia et al, 2023). Collecting realistic latency data would require implementing the tested systems in a voice UX environments with human users, which is beyond the scope of this paper (but see Aylett et al ( 2023)).…”
Section: Limitationsmentioning
confidence: 99%
“…The latency of ASR systems at runtime imposes another formidable bottleneck on voice-driven conversational interfaces, especially as long as they use endpointing methods, where response planning only starts when an utterance end is detected with some probability. User-perceived latency is the single biggest determinant of people's satisfaction with voice assistants (Shangguan et al, 2021;Bijwadia et al, 2023). Collecting realistic latency data would require implementing the tested systems in a voice UX environments with human users, which is beyond the scope of this paper (but see Aylett et al ( 2023)).…”
Section: Limitationsmentioning
confidence: 99%
“…Sequence transducer models, such as the recurrent neural network transducer (RNN-T) [2,3], the Transformer transducer [10,11], and the Conformer transducer (Conformer-T) [12,13], are one of the most promising end-to-end models, especially in streaming scenarios because of its inherently streaming nature. In streaming speech recognition, latency is one of the primary performance metrics along with recognition accuracy, because a lower latency leads to a quick response of voice-enabled applications, and improves the user experience [14]. However, streaming transducer models tend to delay label emission so as to see more future context to predict labels more accurately, which leads to a large latency and a deteriorated user experience.…”
Section: Introductionmentioning
confidence: 99%