Inferring users' emotions for human-mobile voice dialogue applications

Wu, Boya; Jia, Jia; He, Tao; Du, Juan; Yi, Xiaoyuan; Ning, Yishuang

doi:10.1109/icme.2016.7552890

Cited by 11 publications

(5 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [22], a CNNs and Gated Recurrent Unit (GRU)-based neural network was proposed for speaker identification and verification. Additionally, LSTM was used in a hybrid emotion inference model that was proposed for inferring user emotion in a real-world voice-dialogue application, and a recurrent autoencoder was proposed to pre-train the LSTM to improve accuracy [32]. Further, GMM and DNNs were combined to identify distant accents in reverberant environments [26].…”

Section: Related Workmentioning

confidence: 99%

Spectrogram based multi-task audio classification

Zeng

Mao

Peng

et al. 2017

Multimed Tools Appl

135

View full text Add to dashboard Cite

Audio classification is regarded as a great challenge in pattern recognition. Although audio classification tasks are always treated as independent tasks, tasks are essentially related to each other such as speakers' accent and speakers' identification. In this paper, we propose a Deep Neural Network (DNN)-based multi-task model that exploits such relationships and deals with multiple audio classification tasks simultaneously. We term our model as the gated Residual Networks (GResNets) model since it integrates Deep Residual Networks (ResNets) with a gate mechanism, which extract better representations between tasks compared with Convolutional Neural Networks (CNNs). Specifically, two multiplied convolutional layers are used to replace two feed-forward convolution layers in the ResNets. We tested our model on multiple audio classification tasks and found that our multi-task model achieves higher accuracy than task-specific models which train the models separately.

show abstract

Section: Related Workmentioning

confidence: 99%

Spectrogram based multi-task audio classification

Zeng

Mao

Peng

et al. 2017

Multimed Tools Appl

135

View full text Add to dashboard Cite

show abstract

“…Emotion speech recognition is the process of identifying human emotion based on his or her speech [5,6]. Its main task is to analyze human expressions in multiple modalities such as text, speech or video and recognize the underlying emotions [7]. It is usually used in customer service scenarios to evaluate the quality of service (QoS) of agents.…”

Section: Basic Speech Technologymentioning

confidence: 99%

The Development Trend of Intelligent Speech Interaction

Ning

Xing

et al. 2019

Cognitive Computing – ICCC 2019

Self Cite

View full text Add to dashboard Cite

To make the computers have capabilities of listening, speaking, understanding and even thinking is the latest development direction of human-computer interaction. As one of the most convenient and natural ways for communication, speech has become the most promising way of human-computer interaction in the future, which has more advantages than other interaction ways. As one of the most popular artificial intelligence (AI) technologies, intelligent speech interaction technology has been widely applied in many industries such as electronic commerce, smart home and intelligent industry as well as manufacturing. It will change the user behavior habits and become the new mode of human input and output. In this paper, we state the current situation of intelligent speech interaction at home and abroad, take many examples to illustrate the application scenarios of speech interaction technology and finally introduce its development trend in the future.

show abstract

“…[18] performed sentiment analysis on audio data by first transcribing the spoken words and then performing sentiment analysis. Related to audio-based sentiment analysis is the task of estimating emotional state of the speaker from audio input [19]. For the visual modality, the Facial Action Coding System [20] laid the groundwork for analyzing facial expressions and emotions.…”

Section: Related Workmentioning

confidence: 99%

Select-additive learning: Improving generalization in multimodal sentiment analysis

Wang

Meghawat

Morency

et al. 2017

2017 IEEE International Conference on Multimedia and Expo (ICME)

171

View full text Add to dashboard Cite

Multimodal sentiment analysis is drawing an increasing amount of attention these days. It enables mining of opinions in video reviews which are now available aplenty on online platforms. However, multimodal sentiment analysis has only a few high-quality data sets annotated for training machine learning algorithms. These limited resources restrict the generalizability of models, where, for example, the unique characteristics of a few speakers (e.g., wearing glasses) may become a confounding factor for the sentiment classification task. In this paper, we propose a Select-Additive Learning (SAL) procedure that improves the generalizability of trained neural networks for multimodal sentiment analysis. In our experiments, we show that our SAL approach improves prediction accuracy significantly in all three modalities (verbal, acoustic, visual), as well as in their fusion. Our results show that SAL, even when trained on one dataset, achieves good generalization across two new test datasets.

show abstract

Inferring users' emotions for human-mobile voice dialogue applications

Cited by 11 publications

References 15 publications

Spectrogram based multi-task audio classification

Spectrogram based multi-task audio classification

The Development Trend of Intelligent Speech Interaction

Select-additive learning: Improving generalization in multimodal sentiment analysis

Contact Info

Product

Resources

About