End-to-end speech emotion recognition using multi-scale convolution networks

Sivanagaraja, Tatinati; Ho, Mun Kit; Khong, Andy W. H.; Wang, Yübo

doi:10.1109/apsipa.2017.8282026

Cited by 8 publications

(3 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Table 3 presents the performance of the proposed method in comparison with other state-of-the-art proposals for the SAVEE database. Sivanagaraja et al [36] propose a multiscale convolution network (MCNN) for SER using rawWav to train a DNN, which consists of three stages: (i) the signal transformation stage, (ii) the local convolution stage, and (iii) the global convolution stage. Latif et al [37] introduce a deep belief Network (DBN) with three RBM layers using the eGeMAPS features set.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism

Mountzouris,

Perikos,

Hatzilygeroudis

2023

Electronics

View full text Add to dashboard Cite

Speech emotion recognition (SER) is an interesting and difficult problem to handle. In this paper, we deal with it through the implementation of deep learning networks. We have designed and implemented six different deep learning networks, a deep belief network (DBN), a simple deep neural network (SDNN), an LSTM network (LSTM), an LSTM network with the addition of an attention mechanism (LSTM-ATN), a convolutional neural network (CNN), and a convolutional neural network with the addition of an attention mechanism (CNN-ATN), having in mind, apart from solving the SER problem, to test the impact of the attention mechanism on the results. Dropout and batch normalization techniques are also used to improve the generalization ability (prevention of overfitting) of the models as well as to speed up the training process. The Surrey Audio–Visual Expressed Emotion (SAVEE) database and the Ryerson Audio–Visual Database (RAVDESS) were used for the training and evaluation of our models. The results showed that the networks with the addition of the attention mechanism did better than the others. Furthermore, they showed that the CNN-ATN was the best among the tested networks, achieving an accuracy of 74% for the SAVEE database and 77% for the RAVDESS, and exceeding existing state-of-the-art systems for the same datasets.

show abstract

Section: Discussionmentioning

confidence: 99%

“…MCNN Sivanagaraja [36] rawWav 50.28 DBN Latif [37] eGeMAPS 56.76 DNN Fayek, Lech and Cavedon [38] Spectrogram 59.7 HMM Chenchah and Lachiri [39] LFCCs/MFCCs 45/61.25 Proposed method MFCCs 74…”

Section: Models Input Features Savee Test Accuracy (%)mentioning

confidence: 99%

Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism

Mountzouris,

Perikos,

Hatzilygeroudis

2023

Electronics

View full text Add to dashboard Cite

show abstract

“…ASR sistemleri üzerinden sadece konuşma bilgisinin metne dönüştürülmesi değil farklı çalışmalarda gerçekleştirilmiştir. Örneğin, konuşma duygusunun tanımlanması [50], [51], negatif etki ve saldırganlığın otomatik olarak tanımlanmasını sağlayan konuşma analizinin yapılması [52] ve cinsiyet tanınması [53] gibi çalışmalar da mevcuttur. Ayrıca aksan tanıma çalışmaları da ASR sistemlerinin başarımını artırmada önemli rol oynayacağı gibi aynı zamanda konuşmacı hakkında detaylı bilgiler vermektedir [54].…”

Section: Literatür Taraması (Literature Review)unclassified

Otomatik Konuşma Tanımaya Genel Bakış, Yaklaşımlar ve Zorluklar: Türkçe Konuşma Tanımanın Gelecekteki Yolu

Oyucu

Sever

Polat

2019

Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım Ve Teknoloji

View full text Add to dashboard Cite

Figure A shows the application area of the Applications Automatic Speech Recognition (ASR) system. Adaptation is required for adaptation processes to be carried out according to the Application area. During the Speech Processing phase, which is the first entry point of the ASR system, feature extraction is performed from the audio signal. Individual properties are obtained by different feature extraction techniques. For example, Mel Frequency Cepstral Coefficient (MFCC) is a feature extraction technique commonly used in speech recognition systems. Decoder, one of the other components of ASR, converts the feature vectors obtained by using Acoustic Model (AM) and Language Model (LM) into phoneme sequences. In acoustic modeling, firstly, the posterior probability of the phoneme within a given time signal is calculated. In the artificial neural network-based acoustic model, the posterior probability of phonemes is independent for each window. This independence means that the phonemes in a word are independent of each other. Figure A. Basic architecture of speech recognition system.Purpose: This study presents a literature review on speech recognition and then discusses the recorded signs of progress made in this area of research for different languages. The data sets used in speech recognition systems, feature extraction approaches, speech recognition methods and performance evaluation criteria are examined and the focus is on the development of speech recognition and the difficulties in this field. Theory and Methods:In this study, literature review (systematic), which is an important component for a scientific article, was carried out. This process was carried out by the combination of different methods. A combination of review approaches is given.Results: According to the information obtained as a result of the research; Computational architectures that can be applied to resistance to the acoustic environment, self-learning in ASR, detection of unknown words, the success of the Turkish ASR at a broad and limited repertoire level, insufficient source status and Automatic Speech Recognition ASR were evaluated. In addition, the future of Turkish ASR was discussed and recommendations were made to overcome the current difficulties for Turkish ASR. Conclusion:The aim of this study is to examine the current speech recognition methods and approaches and to present the developments in this field in detail. For this reason, approaches, datasets and the difficulties faced by the researchers in their studies in this field are discussed in the scope of the study. The effect of deep learning and classical approaches on ASR was investigated. A road map is provided for researchers to incorporate the detailed information necessary for their work in this field to their own work and to overcome the present challenges.

show abstract

Deep learning approaches for speech emotion recognition: state of the art and research challenges

Jahangir

Wah

Hanif

et al. 2021

Multimed Tools Appl

View full text Add to dashboard Cite

End-to-end speech emotion recognition using multi-scale convolution networks

Cited by 8 publications

References 14 publications

Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism

Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism

Otomatik Konuşma Tanımaya Genel Bakış, Yaklaşımlar ve Zorluklar: Türkçe Konuşma Tanımanın Gelecekteki Yolu

Deep learning approaches for speech emotion recognition: state of the art and research challenges

Contact Info

Product

Resources

About