Improved Robustness to Disfluencies in RNN-Transducer Based Speech Recognition

Mendelev, Valentin; Raissi, Tina; Camporese, Guglielmo; Giollo, Manuel

doi:10.48550/arxiv.2012.06259

Cited by 1 publication

(4 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2.1.3 Dysfluent Speech Recognition. Technical work on improving speech assistants for PWS has focused on ASR models [8,23,31,35,50,51,61], stuttering detection [43], dysfluency detection or classification [22,40,42,48,56], clinical assessment [11], and dataset development [12,37,42,55]. Shonibare et al [61] and Mendelev et al [50] investigate training end-to-end RNN-T ASR models on speech from PWS.…”

Section: Overview Of Speech Recognition Systemsmentioning

confidence: 99%

“…Technical work on improving speech assistants for PWS has focused on ASR models [8,23,31,35,50,51,61], stuttering detection [43], dysfluency detection or classification [22,40,42,48,56], clinical assessment [11], and dataset development [12,37,42,55]. Shonibare et al [61] and Mendelev et al [50] investigate training end-to-end RNN-T ASR models on speech from PWS. Shonibare et al introduces a detect-then-pass approach that incorporates a dysfluency detector where audio frames with dysfluencies are ignored entirely by the RNN-T decoder.…”

Section: Overview Of Speech Recognition Systemsmentioning

confidence: 99%

“…Our approach has been to identify technical improvements that integrate with rather than replace existing systems, by offering relatively small adaptations to the speech recognition pipelinechanging an endpointer threshold, tuning the ASR's decoder model, and inserting a post-processing step to refine dysfluencies. Moreover, the techniques require very little data compared to what would be needed to train endpointer and ASR models from scratch, and are thus in contrast to many recent papers that replace existing systems with large ASR models trained on dysfluent speech (e.g., [2,50,61]) or personalized ASR models that require large amounts of data from a specific person (e.g., [33] which requires 15 -120 minutes of speech).…”

Section: Technical Improvements For Dysfluent Speechmentioning

confidence: 99%

“…Research on speech technology for PWS has largely focused on technical improvements to automatic speech recognition (ASR) models [31,35,50,51,61], dysfluency detection [22,40,42,48], and dataset development [12,37,42,55]. This body of work has largely lacked a human-centered approach to understanding the experiences that PWS have with speech recognition systems [17], which could in turn inform how to prioritize and advance technical improvements.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

From User Perceptions to Technical Improvement: Enabling People Who Stutter to Better Use Speech Recognition

Lea¹,

Huang²,

Tooley³

et al. 2023

Preprint

View full text Add to dashboard Cite

Consumer speech recognition systems do not work as well for many people with speech differences, such as stuttering, relative to the rest of the general population. However, what is not clear is the degree to which these systems do not work, how they can be improved, or how much people want to use them. In this paper, we first address these questions using results from a 61-person survey from people who stutter and find participants want to use speech recognition but are frequently cut off, misunderstood, or speech predictions do not represent intent. In a second study, where 91 people who stutter recorded voice assistant commands and dictation, we quantify how dysfluencies impede performance in a consumer-grade speech recognition system. Through three technical investigations, we demonstrate how many common errors can be prevented, resulting in a system that cuts utterances off 79.1% less often and improves word error rate from 25.4% to 9.9%. CCS CONCEPTS• Human-centered computing → Empirical studies in accessibility.

show abstract