Attention Based Fully Convolutional Network for Speech Emotion Recognition

Zhang, Yuanyuan; Du, Jun; Wang, Zi-Rui; Zhang, Jianshu; Tu, Yan-Hui

doi:10.23919/apsipa.2018.8659587

Cited by 118 publications

(93 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In [17], we proposed a novel attention based fully convolutional neural network for audio emotion recognition. The proposed attention mechanism helps the model focus on the emotion-relevant regions in speech spectrogram.…”

Section: The Proposed Architecturementioning

confidence: 99%

“…The typical CNNs, including AlexNet [23], VGGNet [24], and ResNet [25] take a fixed-size input due to the limitation of fully connected layers. Considering the loss of information caused by the fixed-size input, we proposed a fully convolutional network to handle variable-length speech in [17]. In this study, the same is used as audio encoder, which is shown in Fig.…”

Section: A Audio Streammentioning

confidence: 99%

“…The effectiveness of audio stream for emotion recognition has been proven in [17]. In order to verify that the attention mechanism is also effective for visual sequences, we first design a system in which only video is used, i.e., we simply remove the audio stream and the FBP block in Fig.…”

Section: B Video Systemmentioning

confidence: 99%

“…For the audio stream, the process in [17] is applied to extract the audio feature from raw waveform. First, a sequence of overlapping Hamming windows are applied to the speech waveform, with window shift set to 10 msec, and window size No 62.48% CNN+LMED+LSTM [32] No 61.87% set to 40 msec.…”

Section: Audio-video Fusion Systemmentioning

confidence: 99%

“…With the emerging deep learning, the state-of-the-art classifiers are always CNNs or LSTMs [14], [15], [16]. In [17], we proposed a novel attention based fully convolutional network for this task.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition

Zhang

Wang

2019

2019 International Joint Conference on Neural Networks (IJCNN)

Self Cite

View full text Add to dashboard Cite

Automatic emotion recognition (AER) is a challenging task due to the abstract concept and multiple expressions of emotion. Although there is no consensus on a definition, human emotional states usually can be apperceived by auditory and visual systems. Inspired by this cognitive process in human beings, it's natural to simultaneously utilize audio and visual information in AER. However, most traditional fusion approaches only build a linear paradigm, such as feature concatenation and multi-system fusion, which hardly captures complex association between audio and video. In this paper, we introduce factorized bilinear pooling (FBP) to deeply integrate the features of audio and video. Specifically, the features are selected through the embedded attention mechanism from respective modalities to obtain the emotion-related regions. The whole pipeline can be completed in a neural network. Validated on the AFEW database of the audio-video sub-challenge in EmotiW2018, the proposed approach achieves an accuracy of 62.48%, outperforming the state-of-the-art result.

show abstract

Section: The Proposed Architecturementioning

confidence: 99%