2021
DOI: 10.1109/taslp.2021.3076364
|View full text |Cite
|
Sign up to set email alerts
|

Speech Emotion Recognition Considering Nonverbal Vocalization in Affective Conversations

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 40 publications
(11 citation statements)
references
References 41 publications
0
11
0
Order By: Relevance
“…Out of the selected studies, 39/51 (76.5%) utilized English language datasets with IEMOCAP being used in 30/39 of those cases. 6 of the remaining studies employed 6 databases in Chinese [28][29][30][31][32][33], 2 databases each in French [34,35] and Dutch [36,37], and 1 data set each for Bengali [38], Greek [39], Malay [40], Indonesian [41], and Hungarian [42]. Two studies did not specify the spoken language of the used datasets [43,44].…”
Section: Characteristics Of the Included Studiesmentioning
confidence: 99%
See 1 more Smart Citation
“…Out of the selected studies, 39/51 (76.5%) utilized English language datasets with IEMOCAP being used in 30/39 of those cases. 6 of the remaining studies employed 6 databases in Chinese [28][29][30][31][32][33], 2 databases each in French [34,35] and Dutch [36,37], and 1 data set each for Bengali [38], Greek [39], Malay [40], Indonesian [41], and Hungarian [42]. Two studies did not specify the spoken language of the used datasets [43,44].…”
Section: Characteristics Of the Included Studiesmentioning
confidence: 99%
“…OpenSMILEbased features [32,50,53]); (b) deep-learned features extracted from the raw waveform or image by means of DL (e.g. ResNet18 [32]) or pre-trained transfer-learned feature extractors, e.g. Wav2vec [54]), here accounting for 25.5%(13/51) of total studies; (c) image transformations, summing to 19.6%(10/51) as yielded by advanced signal processing methods of raw waveforms, such as, spectrograms [48,55] or Mel-Frequency Cepstral Coefficients (MFCCs) [47,56]; (d) hybrid approaches, as combinations of two or three of the aforementioned options, here appearing in 25.5%(13/51) of study items.…”
Section: Characteristics Of the Included Studiesmentioning
confidence: 99%
“…The DB consists of both discrete and continuous emotion annotations carried out by 49 annotators consisting of students and professors. The DB has been used for emotion recognition using only speech by considering the nonverbal vocalization in conversations depicting emotions [122]. They used only the audio portion of the DB and used an LSTM to acquire the shifts in the dialogue of the speaker's emotion from a sequence of segmented speech signals.…”
Section: Nnimementioning
confidence: 99%
“…A multi-classifier emotion recognition model based on prosodic information and semantic labels is introduced in [5]. Similarly, the semantic labels and the non-verbal audio in speech, such as onomatopoeia such as crying, laughter, or sighing, are used in SER [6]. Subsequently, temporal and semantic coherence is introduced for SER [7].…”
Section: Introductionmentioning
confidence: 99%