Wav2vec2-based Paralinguistic Systems to Recognise Vocalised Emotions and Stuttering

Grósz, Tamás; Porjazovski, Dejan; Getman, Yaroslav; Kadiri, Sudarsana Reddy; Kurimo, Mikko

doi:10.1145/3503161.3551572

Cited by 16 publications

(9 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [48], several data augmentation techniques such as noise, time stretching, pitch shifting, time shift, masking, etc., were analysed in Dementia detection. There are some studies on data augmentation targeting text based stuttering detection [13], however, in the case of audio based stuttering/disfluency detection, this has not been studied and analysed deeply [49] .…”

Section: B Data Augmentationmentioning

confidence: 99%

Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning

Sheikh

Sahidullah

Hirsch³

et al. 2023

IEEE J. Biomed. Health Inform.

View full text Add to dashboard Cite

Stuttering is a neuro-developmental speech impairment characterized by uncontrolled utterances (interjections) and core behaviors (blocks, repetitions, and prolongations), and is caused by the failure of speech sensorimotors. Due to its complex nature, stuttering detection (SD) is a difficult task. If detected at an early stage, it could facilitate speech therapists to observe and rectify the speech patterns of persons who stutter (PWS). The stuttered speech of PWS is usually available in limited amounts and is highly imbalanced. To this end, we address the class imbalance problem in the SD domain via a multibranching (MB) scheme and by weighting the contribution of classes in the overall loss function, resulting in a huge improvement in stuttering classes on the SEP-28k dataset over the baseline (StutterNet). To tackle data scarcity, we investigate the effectiveness of data augmentation on top of a multi-branched training scheme. The augmented training outperforms the MB StutterNet (clean) by a relative margin of 4.18% in macro F1-score (F 1 ). In addition, we propose a multi-contextual (MC) StutterNet, which exploits different contexts of the stuttered speech, resulting in an overall improvement of 4.48% in F 1 over the single context based MB StutterNet. Finally, we have shown that applying data augmentation in the cross-corpora scenario can improve the overall SD performance by a relative margin of 13.23% in F 1 over the clean training.

show abstract

Section: B Data Augmentationmentioning

confidence: 99%

Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning

Sheikh

Sahidullah

Hirsch³

et al. 2023

IEEE J. Biomed. Health Inform.

View full text Add to dashboard Cite

show abstract

“…Therefore, the pre-trained models can serve as powerful feature extractors in detection systems based on the two-stage pipeline architecture [11]. The main advantage of pre-trained models is that they can be easily fine-tuned using small amounts of labeled data to achieve state-of-art results in the required task [13]- [16]. When the wav2vec2 model is fine-tuned on a specific task, the model is capable of using its knowledge of general characteristics of speech that it has learned by seeing a large amount of speech data in the pre-training phase.…”

Section: Introductionmentioning

confidence: 99%

Exploring the Impact of Fine-Tuning the Wav2vec2 Model in Database-Independent Detection of Dysarthric Speech

Javanmardi,

Kadiri,

Alku

2024

IEEE J. Biomed. Health Inform.

View full text Add to dashboard Cite

Many acoustic features and machine learning models have been studied to build automatic detection systems to distinguish dysarthric speech from healthy speech. These systems can help to improve the reliability of diagnosis. However, speech recorded for diagnosis in real-life clinical conditions can differ from the training data of the detection system in terms of, for example, recording conditions, speaker identity, and language. These mismatches may lead to a reduction in detection performance in practical applications. In this study, we investigate the use of the wav2vec2 model as a feature extractor together with a support vector machine (SVM) classifier to build automatic detection systems for dysarthric speech. The performance of the wav2vec2 features is evaluated in two cross-database scenarios, language-dependent and language-independent, to study their generalizability to unseen speakers, recording conditions, and languages before and after fine-tuning the wav2vec2 model. The results revealed that the fine-tuned wav2vec2 features showed better generalization in both scenarios and gave an absolute accuracy improvement of 1.46% -8.65% compared to the non-fine-tuned wav2vec2 features.

show abstract

“…The organisers of this year's competition presented several solutions as baselines such as DeepSpectrum [2], AuDeep [1,8] and the Com-ParE Acoustic Feature Set. Lastly, the popular pre-trained wav2vec2 model [3], which has exhibited remarkable results in various paralinguistic domains [9,11,17,23,25], was also employed as a baseline.…”

Section: Introductionmentioning

confidence: 99%

Advancing Audio Emotion and Intent Recognition with Large Pre-Trained Models and Bayesian Inference

Porjazovski,

Getman,

Grósz

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Large pre-trained models are essential in paralinguistic systems, demonstrating effectiveness in tasks like emotion recognition and stuttering detection. In this paper, we employ large pre-trained models for the ACM Multimedia Computational Paralinguistics Challenge, addressing the Requests and Emotion Share tasks. We explore audio-only and hybrid solutions leveraging audio and text modalities. Our empirical results consistently show the superiority of the hybrid approaches over the audio-only models. Moreover, we introduce a Bayesian layer as an alternative to the standard linear output layer. The multimodal fusion approach achieves an 85.4% UAR on HC-Requests and 60.2% on HC-Complaints. The ensemble model for the Emotion Share task yields the best 𝜌 value of .614. The Bayesian wav2vec2 approach, explored in this study, allows us to easily build ensembles, at the cost of fine-tuning only one model. Moreover, we can have usable confidence values instead of the usual overconfident posterior probabilities. CCS CONCEPTS• Computing methodologies → Machine learning; Natural language processing; • Mathematics of computing → Bayesian networks.

show abstract

Wav2vec2-based Paralinguistic Systems to Recognise Vocalised Emotions and Stuttering

Cited by 16 publications

References 12 publications

Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning

Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning

Exploring the Impact of Fine-Tuning the Wav2vec2 Model in Database-Independent Detection of Dysarthric Speech

Advancing Audio Emotion and Intent Recognition with Large Pre-Trained Models and Bayesian Inference

Contact Info

Product

Resources

About