Learning Paralinguistic Features from Audiobooks through Style Voice Conversion

Aldeneh, Zakaria; Perez, Matthew; Provost, Emily Mower

doi:10.18653/v1/2021.naacl-main.377

Cited by 1 publication

(1 citation statement)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Aldeneh et al [11] proposed a framework to learn to extract paralinguistic embedding. The authors showed that converting synthetic-neutral speech to expressive speech based on that embedding improved the results from acoustic features and other evaluated embeddings.…”

Section: Related Work: Ssl and Sermentioning

confidence: 99%

Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition

Atmaja

2022

IEEE Access

View full text Add to dashboard Cite

Self-supervised learning has recently been implemented widely in speech processing areas, replacing conventional acoustic feature extraction to extract meaningful information from speech. One of the challenging applications of speech processing is to extract affective information from speech, commonly called speech emotion recognition. Until now, it is not clear the position of these speech representations compared to the classical acoustic feature. This paper evaluates nineteen self-supervised speech representations and one classical acoustic feature for five distinct speech emotion recognition datasets on the same classifier. We calculate the effect size among twenty speech representations to show the magnitude of relative differences from the top to the lowest performance. The top three are WavLM Large, UniSpeech-SAT Large, and HuBERT Large, with negligible effect sizes among them. The significance test supports the difference among self-supervised speech representations. The best prediction for each dataset is shown in the form of a confusion matrix to gain insights into the best performance of speech representations for each emotion category based on the training data from balanced vs. unbalanced datasets, English vs. Japanese corpus, and five vs. six emotion categories. Despite showing their competitiveness, this exploration of self-supervised learning for speech emotion recognition also shows their limitations on models pre-trained on small data and trained on unbalanced datasets.

show abstract

Section: Related Work: Ssl and Sermentioning

confidence: 99%

Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition

Atmaja

2022

IEEE Access

View full text Add to dashboard Cite

show abstract

Learning Paralinguistic Features from Audiobooks through Style Voice Conversion

Cited by 1 publication

References 27 publications

Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition

Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition

Contact Info

Product

Resources

About