Gender-Driven Emotion Recognition Through Speech Signals For Ambient Intelligence Applications

Bisio, Igor; Delfino, Alessandro; Lavagetto, Fabio; Marchese, Mario; Sciarrone, Andrea

doi:10.1109/tetc.2013.2274797

Cited by 81 publications

(40 citation statements)

References 33 publications

(49 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Other works, such as [5] employ the RelAtive SpecTral Amplitude (RASTA) coefficients and the Perceptual Linear Prediction (PLP) coefficients. As reported in [2], an increase in classication performance would usually be expected when more features are used. In addition, if the features number grows, more computations and energy will be needed by the smartphone to carry out the feature set.…”

Section: A Front-end and Feature Extractionmentioning

confidence: 73%

“…The human voice y(t) is acquired (with a sample frequency F s = 8 [KHz]) by the smartphone's microphone and it is filtered with a second-order Butterworth BPF with bandwith B ∈ [50, 500] [Hz]. Since the speech signal is not long-term stationary, it is very common to divide the signal into short segments called frames, during which the speech signal can be considered as stationary [2]. In our case, each frame has a length of T = 40 [ms] and frames are overlapped for one third of their duration.…”

Section: The Proposed Spectra Applicationmentioning

confidence: 99%

“…Finally, the vector C f of the MFCC of each frame f is computed by using the Discrete Cosine Transform (DCT) [2]. To capture some of the dynamic information of spectral features, another method riles on the extraction of the ∆ and ∆∆ coefficients, both obtained from the coefficients of generic reference feature (i.e., the MFCC in the case of this paper).…”

Section: A Front-end and Feature Extractionmentioning

confidence: 99%

See 2 more Smart Citations

SPECTRA: A SPEech proCessing plaTform as smaRtphone Application

Bisio

Lavagetto

Marchese

et al. 2015

2015 IEEE International Conference on Communications (ICC)

Self Cite

View full text Add to dashboard Cite

In this paper, an Android SPEech proCessing plaTform as smaRtphone Application (SPECTRA) is presented. Such application, developed by the authors, has three main functions: i) Gender Recognition (GR), ii) Speaker Recognition (SR) and iii) Language Recognition (LR). All these recognition functions are performed simultaneously by using unsupervised Support Vector Machine (SVM) classifiers. An innovative point of this paper lies in the automatic re-training of the employed SVMs which are able to dynamically update themselves when a (new) audio from a (new) speaker is provided. This allow to build more robust classifiers, which results in better recognition performances. In terms of accuracy, the GR reaches about 98% of correct classifications, SR performs around 80% while LR shows an accuracy of about 74%.

show abstract

Section: A Front-end and Feature Extractionmentioning

confidence: 73%

Section: The Proposed Spectra Applicationmentioning

confidence: 99%

Section: A Front-end and Feature Extractionmentioning

confidence: 99%

See 1 more Smart Citation

SPECTRA: A SPEech proCessing plaTform as smaRtphone Application

Bisio

Lavagetto

Marchese

et al. 2015

2015 IEEE International Conference on Communications (ICC)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The speaker's gender plays a significant role, since both genders have significantly different vocal features and may express their emotions differently. At an attempt to investigate such issues, Bisio et al [19] demonstrated that the a-priori knowledge of gender may lead to a significant increase of performance, thus they proposed a system whose initial step was to classify the speaker's gender based on spectral features of her/his voice. Typical recognition schemes work with utterances.…”

Section: Related Workmentioning

confidence: 99%

Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition

et al. 2017

View full text Add to dashboard Cite

Emotion recognition from speech may play a crucial role in many applications related to human-computer interaction or understanding the affective state of users in certain tasks, where other modalities such as video or physiological parameters are unavailable. In general, a human's emotions may be recognized using several modalities such as analyzing facial expressions, speech, physiological parameters (e.g., electroencephalograms, electrocardiograms) etc. However, measuring of these modalities may be difficult, obtrusive or require expensive hardware. In that context, speech may be the best alternative modality in many practical applications. In this work we present an approach that uses a Convolutional Neural Network (CNN) functioning as a visual feature extractor and trained using raw speech information. In contrast to traditional machine learning approaches, CNNs are responsible for identifying the important features of the input thus, making the need of hand-crafted feature engineering optional in many tasks. In this paper no extra features are required other than the spectrogram representations and hand-crafted features were only extracted for validation purposes of our method. Moreover, it does not require any linguistic model and is not specific to any particular language. We compare the proposed approach using cross-language datasets and demonstrate that it is able to provide superior results vs. traditional ones that use hand-crafted features.

show abstract

“…Để nhận dạng cảm xúc cho tiếng nói thu âm từ một tổng đài trả lời tự động, Laurence Vidrascu [5] sử dụng máy hỗ trợ véc tơ SVM và mô hình cây logic (LMT: Logistic Model Tree). Kalyana Kumar Inakollu [11], sử dụng mô hình hỗn hợp Gauss đa thể hiện (GMM: Gaussian Mixture Model) với tiếng nói đƣợc mô hình hóa bởi các hệ số theo thang tần số Mel (MFCC: Mel Frequency Cepstral Coefficients) [12]. Thurid [16] sử dụng thông tin về giới tính để cải thiện hiệu năng của hệ thống nhận dạng cảm xúc.…”

unclassified