Speech Emotion Recognition in Acted and Spontaneous Context

Chenchah, Farah; Lachiri, Zied

doi:10.1016/j.procs.2014.11.020

Cited by 18 publications

(3 citation statements)

References 14 publications

(7 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As mentioned, the number of emotional corpora with nonverbal expressions is limited. Emotional speech corpora can be roughly categorized into two types: acted and spontaneous [8]. In acted corpora, scripts and a recording instruction are provided to the speakers before the recording, which makes it easy to control label balance and almost does not need extra manual annotation.…”

Section: B: Emotional Speech Corpus With Nonverbal Expressionsmentioning

confidence: 99%

“…To show the contributions of NVs on emotion recognizability, we also evaluate the verbal parts of JVNV by removing nonverbal parts from the utterances of JVNV, which is denoted as ''JVNV-V''. We conducted a forced choice task on a Japanese crowdsourcing platform 8 . For each corpus, we randomly picked up 60 emotion-balanced samples.…”

Section: B Emotion Recognizabilitymentioning

confidence: 99%

See 1 more Smart Citation

JVNV: A Corpus of Japanese Emotional Speech With Verbal Content and Nonverbal Expressions

Xin,

Jiang,

Takamichi

et al. 2024

IEEE Access

View full text Add to dashboard Cite

We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to produce emotional scripts by providing seed words with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using prompt engineering. We select 514 scripts with balanced phoneme coverage from the generated candidate scripts with the assistance of emotion confidence scores and language fluency scores. We demonstrate the effectiveness of JVNV by showing that JVNV has better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora. We then benchmark JVNV on emotional text-to-speech synthesis using discrete codes to represent NVs. We show that there still exists a gap between the performance of synthesizing read-aloud speech and emotional speech, and adding NVs in the speech makes the task even harder, which brings new challenges for this task and makes JVNV a valuable resource for relevant works in the future. To our best knowledge, JVNV is the first speech corpus that generates scripts automatically using large language models.

show abstract

Section: B: Emotional Speech Corpus With Nonverbal Expressionsmentioning

confidence: 99%

Section: B Emotion Recognizabilitymentioning

confidence: 99%

JVNV: A Corpus of Japanese Emotional Speech With Verbal Content and Nonverbal Expressions

Xin,

Jiang,

Takamichi

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Due to these limitations, audio samples in many speechbased emotion datasets are collected through acted elicitation methods relying on individuals who engender a target emotion while uttering pre-determined linguistic contents, also known as scripted speech [6]. Despite the fact that these methods tend to overlook subtle expression details, they provide ample data, based on which machine learning (ML) methodologies can recognize emotions [7].…”

Section: Introductionmentioning

confidence: 99%

Few-shot Learning in Emotion Recognition of Spontaneous Speech Using a Siamese Neural Network with Adaptive Sample Pair Formation

Feng,

Chaspari

2021

Preprint

View full text Add to dashboard Cite

Speech-based machine learning (ML) has been heralded as a promising solution for tracking prosodic and spectrotemporal patterns in real-life that are indicative of emotional changes, providing a valuable window into one's cognitive and mental state. Yet, the scarcity of labelled data in ambulatory studies prevents the reliable training of ML models, which usually rely on "data-hungry" distribution-based learning. Leveraging the abundance of labelled speech data from acted emotions, this paper proposes a few-shot learning approach for automatically recognizing emotion in spontaneous speech from a small number of labelled samples. Few-shot learning is implemented via a metric learning approach through a siamese neural network, which models the relative distance between samples rather than relying on learning absolute patterns of the corresponding distributions of each emotion. Results indicate the feasibility of the proposed metric learning in recognizing emotions from spontaneous speech in four datasets, even with a small amount of labelled samples. They further demonstrate superior performance of the proposed metric learning compared to commonly used adaptation methods, including network fine-tuning and adversarial learning. Findings from this work provide a foundation for the ambulatory tracking of human emotion in spontaneous speech contributing to the real-life assessment of mental health degradation.

show abstract