2008 IEEE International Conference on Multimedia and Expo 2008
DOI: 10.1109/icme.2008.4607689
|View full text |Cite
|
Sign up to set email alerts
|

Combining speech recognition and acoustic word emotion models for robust text-independent emotion recognition

Abstract: Recognition of emotion in speech usually uses acoustic models that ignore the spoken content. Likewise one general model per emotion is trained independent of the phonetic structure. Given sufficient data, this approach seemingly works well enough. Yet, this paper tries to answer the question whether acoustic emotion recognition strongly depends on phonetic content, and if models tailored for the spoken unit can lead to higher accuracies. We therefore investigate phoneme-, and word-models by use of a large pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
4
0

Year Published

2010
2010
2021
2021

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 21 publications
(5 citation statements)
references
References 4 publications
1
4
0
Order By: Relevance
“…Similar observations have been reported by Schuller et al [14]. In their experiments, they trained multiple emotion classification models on phoneme and word level segments.…”
Section: Related Worksupporting
confidence: 82%
“…Similar observations have been reported by Schuller et al [14]. In their experiments, they trained multiple emotion classification models on phoneme and word level segments.…”
Section: Related Worksupporting
confidence: 82%
“…A possibility to use static classifiers for frame-level feature processing is further given by multi-instance learning techniques, where a time series of unknown length is handled as one by SVM or similar techniques [206,187]. Still, when the spoken content is fixed, the combination of static and dynamic processing may help improve overall accuracy [224,195].…”
Section: Classificationmentioning
confidence: 99%
“…The mean value of overlap has been fixed to 15% of the speech frames for the overall dataset. For each sentence the amount of overlap is obtained as a random value drown from the uniform distribution on the interval [12,18]. This assumption allows the artificial database to reflect the frequency of overlapped speech in real-life scenarios such as two-party telephone conversation or meeting (Shriberg et al (2000)).…”
Section: Corpus Descriptionmentioning
confidence: 99%