Continuous Speech Emotion Recognition with Convolutional Neural Networks

Vryzas, Nikolaos; Vrysis, Lazaros; Matsiola, María; Kotsakis, Rigas; Dimoulas, Charalampos; Kalliris, George

doi:10.17743/jaes.2019.0043

Cited by 51 publications

(25 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such analytics, in correlation with the delivered content, provide insight for future planning. The baseline metadata scheme can also be extended to involve speech [55,56] and music [57,58] emotional cues. As it is depicted in Figure 4, the functionality that concerns different groups of interest is unified in a common framework.…”

Section: 2a Web Application For Live Radio Production and Annotationmentioning

confidence: 99%

“…To make the most of the available data and improve generalization, some common audio data augmentation techniques [56,60] have been applied:…”

Section: Speaker Recognition With Convolutional Neural Networkmentioning

confidence: 99%

“…A convolutional neural network architecture was used for classification [48,56]. The CNN is much more lightweight than the LSTM architectures, while it also models spectro-temporal information when it is fed with spectrograms as input [48,56]. The architecture of the network along with the selected hyperparameter values are presented in Table 1 and Figure 5.…”

mentioning

confidence: 99%

“…An audio file of 20 min in length was used, containing speech from three speakers, two male and one Implementation of the described architecture and training was held using the Keras toolbox for the Python programming language [62]. The librosa toolkit for Python [63] was used to extract Mel-scale spectrograms with a dimension of 128 Mel-coefficients from the audio files with a sampling frequency of fs = 44,100 samples/s for windows of 1 s with 90% overlap [48,56]. The extracted spectrograms were used as input to the 2D convolutional neural network.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Web Radio Automation for Audio Stream Management in the Era of Big Data

2020

Self Cite

View full text Add to dashboard Cite

Radio is evolving in a changing digital media ecosystem. Audio-on-demand has shaped the landscape of big unstructured audio data available online. In this paper, a framework for knowledge extraction is introduced, to improve discoverability and enrichment of the provided content. A web application for live radio production and streaming is developed. The application offers typical live mixing and broadcasting functionality, while performing real-time annotation as a background process by logging user operation events. For the needs of a typical radio station, a supervised speaker classification model is trained for the recognition of 24 known speakers. The model is based on a convolutional neural network (CNN) architecture. Since not all speakers are known in radio shows, a CNN-based speaker diarization method is also proposed. The trained model is used for the extraction of fixed-size identity d-vectors. Several clustering algorithms are evaluated, having the d-vectors as input. The supervised speaker recognition model for 24 speakers scores an accuracy of 88.34%, while unsupervised speaker diarization scores a maximum accuracy of 87.22%, as tested on an audio file with speech segments from three unknown speakers. The results are considered encouraging regarding the applicability of the proposed methodology.

show abstract

Section: 2a Web Application For Live Radio Production and Annotationmentioning

confidence: 99%

“…To make the most of the available data and improve generalization, some common audio data augmentation techniques [56,60] have been applied:…”

Section: Speaker Recognition With Convolutional Neural Networkmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Web Radio Automation for Audio Stream Management in the Era of Big Data

2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this context, new audio recognition and semantic analysis techniques are deployed for General Audio Detection and Classification (GADC) tasks, which are very useful in many multidisciplinary domains [4][5][6][7][8][9][10][11][12][13][14][15][16]. Typical examples include speech recognition and perceptual enhancement [5][6][7][8], speaker indexing and diarization [14][15][16][17][18][19], voice/music detection and discrimination [1][2][3][4][9][10][11][12][13][20][21][22], information retrieval and genre classification of music [23,24], audio-driven alignment of multiple recordings [25,26], sound emotion recognition [27][28][29] and others [10,[30][31][32]. Concerning the media production and broadcasting domain, audio and audio-driven segmentation allow for the implementation of prope...…”

Section: Introductionmentioning

confidence: 99%

Investigation of Spoken-Language Detection and Classification in Broadcasted Audio Content

et al. 2020

Self Cite

View full text Add to dashboard Cite

The current paper focuses on the investigation of spoken-language classification in audio broadcasting content. The approach reflects a real-word scenario, encountered in modern media/monitoring organizations, where semi-automated indexing/documentation is deployed, which could be facilitated by the proposed language detection preprocessing. Multilingual audio recordings of specific radio streams are formed into a small dataset, which is used for the adaptive classification experiments, without seeking-at this step-for a generic language recognition model. Specifically, hierarchical discrimination schemes are followed to separate voice signals before classifying the spoken languages. Supervised and unsupervised machine learning is utilized at various windowing configurations to test the validity of our hypothesis. Besides the analysis of the achieved recognition scores (partial and overall), late integration models are proposed for semi-automatically annotation of new audio recordings. Hence, data augmentation mechanisms are offered, aiming at gradually formulating a Generic Audio Language Classification Repository. This database constitutes a program-adaptive collection that, beside the self-indexing metadata mechanisms, could facilitate generic language classification models in the future, through state-of-art techniques like deep learning. This approach matches the investigatory inception of the project, which seeks for indicators that could be applied in a second step with a larger dataset and/or an already pre-trained model, with the purpose to deliver overall results.

show abstract

Design and Implementation of English Speech Scoring Data System Based on Neural Network Algorithm

Sun

2022

Cyber Security Intelligence and Analytics

View full text Add to dashboard Cite

Continuous Speech Emotion Recognition with Convolutional Neural Networks

Cited by 51 publications

References 0 publications

Web Radio Automation for Audio Stream Management in the Era of Big Data

Web Radio Automation for Audio Stream Management in the Era of Big Data

Investigation of Spoken-Language Detection and Classification in Broadcasted Audio Content

Design and Implementation of English Speech Scoring Data System Based on Neural Network Algorithm

Contact Info

Product

Resources

About