Speech is critical for interpersonal communication, but not everyone has fluent communication skills. Speech disfluency, including stuttering and interruptions, affects not only emotional expression but also clarity of expression for people who stutter. Existing methods for detecting speech disfluency rely heavily on annotated data, which can be costly. Additionally, these methods have not considered the issue of variable-length disfluent speech, which limits the scalability of detection methods. To address these limitations, this paper proposes an automated method for detecting speech disfluency that can improve communication skills for individuals and assist therapists in tracking the progress of stuttering patients. The proposed method focuses on detecting four types of disfluency features using single-task detection and utilizes embeddings from the pre-trained wav2vec2.0 model, as well as convolutional neural network (CNN) and Transformer models for feature extraction. The model’s scalability is improved by considering the issue of variable-length disfluent speech and modifying the model based on the entropy invariance of attention mechanisms. The proposed automated method for detecting speech disfluency has the potential to assist individuals in overcoming speech disfluency, improve their communication skills, and aid therapists in tracking the progress of stuttering patients. Additionally, the model’s scalability across languages and lengths enhances its practical applicability. The experiments demonstrate that the model outperforms baseline models in both English and Chinese datasets, proving its universality and scalability in real-world applications.
The automatic fluency assessment of spontaneous speech without reference text is a challenging task that heavily depends on the accuracy of automatic speech recognition (ASR). Considering this scenario, it is necessary to explore an assessment method that combines ASR. This is mainly due to the fact that in addition to acoustic features being essential for assessment, the text features output by ASR may also contain potentially fluency information. However, most existing studies on automatic fluency assessment of spontaneous speech are based solely on audio features, without utilizing textual information, which may lead to a limited understanding of fluency features. To address this, we propose a multimodal automatic speech fluency assessment method that combines ASR output. Specifically, we first explore the relevance of the fluency assessment task to the ASR task and fine-tune the Wav2Vec2.0 model using multi-task learning to jointly optimize the ASR task and fluency assessment task, resulting in both the fluency assessment results and the ASR output. Then, the text features and audio features obtained from the fine-tuned model are fed into the multimodal fluency assessment model, using attention mechanisms to obtain more reliable assessment results. Finally, experiments on the PSCPSF and Speechocean762 dataset suggest that our proposed method performs well in different assessment scenarios.
The Putonghua Proficiency Test (Putonghua Shuiping Ceshi, PSC) is a speaking test in China. The propositional speaking section in PSC focuses on the ability of speakers to express ideas fluently and accurately without textual reference. However, unlike other sections of the PSC, propositional speaking is still scored manually, which can result in inefficiency, high costs, and subjectivity. To address these issues, an automatic speech fluency evaluation method based on multimodality is proposed. First, different neural networks are used to extract unimodal features. Then, cross-modal attention is applied to achieve multimodal fusion. Finally, fluency evaluation results are obtained using self-attention to reinforce high-contributing information. The proposed method achieves 81.67[Formula: see text] accuracy on a self-built dataset, demonstrating that combining textual and acoustic features provides complementary information to improve automatic speech fluency evaluation accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.