Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1298
|View full text |Cite
|
Sign up to set email alerts
|

Learning Conditional Acoustic Latent Representation with Gender and Age Attributes for Automatic Pain Level Recognition

Abstract: Pain is an unpleasant internal sensation caused by bodily damages or physical illnesses with varied expressions conditioned on personal attributes. In this work, we propose an age-gender embedded latent acoustic representation learned using conditional maximum mean discrepancy variational autoencoder (MMD-CVAE). The learned MMD-CVAE embeds personal attributes information directly in the latent space. Our method achieves a 70.7% in extreme set classification (severe versus mild) and 47.7% in three-class recogni… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
17
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(17 citation statements)
references
References 26 publications
0
17
0
Order By: Relevance
“…Thiam et al [60], [61] analyzed the audio signals of the SenseEmotion database, which do not contain verbal interaction, but mostly breathing noises and sporadic moaning sounds. In contrast, Tsai et al [97], [98] and Li et al [84] analyzed audio signals recorded during clinical interviews in an emergency triage situation. Whereas audio outperformed video-based facial expression recognition in Tsai et al [97], the opposite results were found by Thiam et al [60].…”
Section: Audio Approachesmentioning
confidence: 99%
See 2 more Smart Citations
“…Thiam et al [60], [61] analyzed the audio signals of the SenseEmotion database, which do not contain verbal interaction, but mostly breathing noises and sporadic moaning sounds. In contrast, Tsai et al [97], [98] and Li et al [84] analyzed audio signals recorded during clinical interviews in an emergency triage situation. Whereas audio outperformed video-based facial expression recognition in Tsai et al [97], the opposite results were found by Thiam et al [60].…”
Section: Audio Approachesmentioning
confidence: 99%
“…In the audio domain the most widely used features are Mel Frequency Cepstral Coefficients (MFCC) [60], [61], [69], [70], [91], [97], [98], a spectral representation of sound that approximates the human auditory system's response. Other features include pitch [68], [84], [91], [97], [98], intensity [84], [91], [97], [98], Relative Spectral Perceptual Linear Predictive (RASTA-PLP) coefficients [60], [61], [91], Linear Predictive Coding (LPC) coefficients [60], [70], [91], harmonic to noise ratio [98], and formants [68]. It is common to include the first and second order temporal derivatives of features [60], [61], [91], [97], [98].…”
Section: Audio Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…Further, voice cues may reveal a speaker's smoking habit: A linear relationship has been observed between the number of cigarettes smoked per day and certain voice features, allowing for speech-based smoker detection in a relatively early stage of the habit (<10 years) [30]. Recorded human sounds can also be used for the automatic recognition of physical pain levels [61] and the detection of sleep disorders like obstructive sleep apnea [19].…”
Section: Speaker Pathologymentioning
confidence: 99%
“…These studies tend to focus more on the prosodic and spectral properties of speech. Furthermore, except for a recent work done by Li et al that integrated gender and age attributes as auxiliary information to improve the vocal-based pain-level recognition [14], little if any work has studied exactly how various clinical attributes interact with acoustic manifestation across different pain-levels.…”
Section: Introductionmentioning
confidence: 99%