2017 IEEE International Conference on Multimedia and Expo (ICME) 2017
DOI: 10.1109/icme.2017.8019296
|View full text |Cite
|
Sign up to set email alerts
|

Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition

Abstract: Deep convolutional neural networks are being actively investigated in a wide range of speech and audio processing applications including speech recognition, audio event detection and computational paralinguistics, owing to their ability to reduce factors of variations, for learning from speech. However, studies have suggested to favor a certain type of convolutional operations when building a deep convolutional neural network for speech applications although there has been promising results using different typ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
80
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 117 publications
(81 citation statements)
references
References 46 publications
1
80
0
Order By: Relevance
“…The sequence of acoustic spectral features is first input to multiple 1D CNN layers. The CNN kernel filters shift along the temporal axis and include the entire spectrum information per scan, which is proven to have better performance than other kernel structure settings by [21]. CNN filters with different weights are utilized to extract different information from same input features and followed by recurrent layers to capture context and dynamic information within each speech segment.…”
Section: Model Structurementioning
confidence: 99%
“…The sequence of acoustic spectral features is first input to multiple 1D CNN layers. The CNN kernel filters shift along the temporal axis and include the entire spectrum information per scan, which is proven to have better performance than other kernel structure settings by [21]. CNN filters with different weights are utilized to extract different information from same input features and followed by recurrent layers to capture context and dynamic information within each speech segment.…”
Section: Model Structurementioning
confidence: 99%
“…The CLDNNs model is trained on the log-Mel filter bank energies [14] and on the raw waveform speech signal [15] for speech recognition, and the results showed that both CLDNN models outperform CNN and LSTM alone or combined. Similarly, in [16] and [17] CLDNN-based speech emotion recognition experiments are conducted on log-Mels and spectrograms respectively. In [18], a network architecture of convolutional recurrent neural network (CRNN) is proposed for large vocabulary speech recognition by combining the CNN and LSTM-RNN.…”
Section: Related Workmentioning
confidence: 99%
“…In this paper, we selected the noise signals assuming the environments in which a speech emotion recognition system is likely to be used. Some of these noise conditions are those of conventional studies (e.g., the car) [15,17]. The first three noise signals were mixed with the training and the validation sets, and the other two noise signals were mixed with the test set.…”
Section: Noise Datasetmentioning
confidence: 99%
“…Such techniques seem to be effective for enhancing noise tolerance even in speech emotion recognition. Some studies introduced multi-condition training to speech emotion recognition [15][16][17]. For example, Heracleous et al [15] introduced multi-condition training into the i-vector and PLDA approach for recognizing the emotion of speech.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation