Improved Robustness to Disfluencies in Rnn-Transducer Based Speech Recognition

Mendelev, Valentin; Raissi, Tina; Camporese, Guglielmo; Giollo, Manuel

doi:10.1109/icassp39728.2021.9413618

Cited by 12 publications

(6 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A recent preprint by Mendelev et al [17] is most similar to our work. They built an end-to-end speech recognition model using typical speech and speech with dysfluencies and show a 16% relative improvement in WER on some voice command tasks for users who stutter compared to a baseline without stuttered speech.…”

Section: Introductionsupporting

confidence: 90%

Analysis and Tuning of a Voice Assistant System for Dysfluent Speech

Mitra

Huang²,

Tooley³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work. Current speech recognition systems are trained primarily with data from fluent speakers and as a consequence do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks. The focus of this work is on quantitative analysis of a consumer speech recognition system on individuals who stutter and production-oriented approaches for improving performance for common voice assistant tasks (i.e., "what is the weather?"). At baseline, this system introduces a significant number of insertion and substitution errors resulting in intended speech Word Error Rates (isWER) that are 13.64% worse (absolute) for individuals with fluency disorders. We show that by simply tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24% (relative) for individuals with fluency disorders. Tuning these parameters translates to 3.6% better domain recognition and 1.7% better intent recognition relative to the default setup for the 18 study participants across all stuttering severities.

show abstract

Section: Introductionsupporting

confidence: 90%

Analysis and Tuning of a Voice Assistant System for Dysfluent Speech

Mitra

Huang²,

Tooley³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…al. [9] studied the robustness of RNN-T based ASR models on disfluent speech that contained organic disfluencies like partial words using filters on utterance transcriptions that are indicative of hesitations and repetitions. We introduce the term organic disfluency to distinguish the speech containing hesitations and repetitions from people who do not self identify as People Who Stutter or have Stutter is observed in 5% to 10% of children's speech , who are aged between 2 and 6 years [11].…”

Section: Related Workmentioning

confidence: 99%

Enhancing ASR for Stuttered Speech with Limited Data Using Detect and Pass

Shonibare¹,

Tong²,

Ravichandran³

2022

Preprint

View full text Add to dashboard Cite

It is estimated that around 70 million people worldwide are affected by a speech disorder called stuttering [1]. With recent advances in Automatic Speech Recognition (ASR), voice assistants are increasingly useful in our everyday lives. Many technologies in education, retail, telecommunication and healthcare can now be operated through voice. Unfortunately, these benefits are not accessible for People Who Stutter (PWS). We propose a simple but effective method called 'Detect and Pass' to make modern ASR systems accessible for People Who Stutter in a limited data setting. The algorithm uses a context aware classifier trained on a limited amount of data, to detect acoustic frames that contain stutter. To improve robustness on stuttered speech, this extra information is passed on to the ASR model to be utilized during inference. Our experiments show a reduction of 12.18% to 71.24% in Word Error Rate (WER) across various state of the art ASR systems. Upon varying the threshold of the associated posterior probability of stutter for each stacked frame used in determining low frame rate (LFR) acoustic features, we were able to determine an optimal setting that reduced the WER by 23.93% to 71.67% across different ASR systems.

show abstract

“…The research team initially collected data on speech dysfluencies and subsequently utilized this data to retrain their algorithms. By increasing the amount of training data with dysfluencies, they successfully improved the accuracy of their algorithms [30].…”

Section: Automatic Speech Recognition For Users With Diverse Needsmentioning

confidence: 99%

“…Only by better characterizing the causes and types of errors in the recognition algorithms can it be improved to meet the needs of all user groups [1]. Hence, to facilitate the characterization of speech errors, we analyzed the substitution, deletion, and insertion of speech recognition when users with Down Syndrome interact with speech algorithms, following the approach from related work targeting speech differences [29][30][31].…”

Section: Introductionmentioning

confidence: 99%

Limitations in Speech Recognition for Young Adults with Down Syndrome

Cibrian,

Chen,

Anderson

et al. 2024

Preprint

View full text Add to dashboard Cite

Speech recognition has the potential to make technology more accessible to users. However, the accuracy of speech recognition remains limited for users with disabilities, including those with Down Syndrome, and the types and frequencies of recognition errors are poorly understood. This paper characterizes these problems, focusing on errors occurring when recognizing Down Syndrome speech. We analyze the transcripts from six speech recognition algorithms (Google, IBM, Otter.ai, Microsoft, AssemblyAI, OpenAI) using the audio content of 15 individuals with Down Syndrome (331 dialogues; 3,428 words). Our analysis shows (1) significant difference in speech recognition accuracy for people with Down Syndrome compared to neurotypical users; (2) the best algorithm for recognizing Down Syndrome speech is OpenAI (Word Accuracy = 67\%; F1-score = 0.944), and (3) there is a prevalence of deletion errors followed by substitutions and insertions. These findings have implications for enhancing speech recognition for the next-generation voice assistants to meet the needs of users with Down Syndrome.

show abstract

Improved Robustness to Disfluencies in Rnn-Transducer Based Speech Recognition

Cited by 12 publications

References 12 publications

Analysis and Tuning of a Voice Assistant System for Dysfluent Speech

Analysis and Tuning of a Voice Assistant System for Dysfluent Speech

Enhancing ASR for Stuttered Speech with Limited Data Using Detect and Pass

Limitations in Speech Recognition for Young Adults with Down Syndrome

Contact Info

Product

Resources

About