Speech Emotion Recognition for Performance Interaction

Vryzas, Nikolaos; Kotsakis, Rigas; Liatsou, A.; Dimoulas, Charalampos; Kalliris, George

doi:10.17743/jaes.2018.0036

Cited by 54 publications

(21 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We have used a cross-corpus, vocabulary-independent and language-independent evaluation strategy. The unknown speakers have been selected from the AESDD dataset described in [55]. The unsupervised diarization results are shown in Table 2.…”

Section: Resultsmentioning

confidence: 99%

“…Such analytics, in correlation with the delivered content, provide insight for future planning. The baseline metadata scheme can also be extended to involve speech [55,56] and music [57,58] emotional cues. As it is depicted in Figure 4, the functionality that concerns different groups of interest is unified in a common framework.…”

Section: 2a Web Application For Live Radio Production and Annotationmentioning

confidence: 99%

See 1 more Smart Citation

Web Radio Automation for Audio Stream Management in the Era of Big Data

2020

Self Cite

View full text Add to dashboard Cite

Radio is evolving in a changing digital media ecosystem. Audio-on-demand has shaped the landscape of big unstructured audio data available online. In this paper, a framework for knowledge extraction is introduced, to improve discoverability and enrichment of the provided content. A web application for live radio production and streaming is developed. The application offers typical live mixing and broadcasting functionality, while performing real-time annotation as a background process by logging user operation events. For the needs of a typical radio station, a supervised speaker classification model is trained for the recognition of 24 known speakers. The model is based on a convolutional neural network (CNN) architecture. Since not all speakers are known in radio shows, a CNN-based speaker diarization method is also proposed. The trained model is used for the extraction of fixed-size identity d-vectors. Several clustering algorithms are evaluated, having the d-vectors as input. The supervised speaker recognition model for 24 speakers scores an accuracy of 88.34%, while unsupervised speaker diarization scores a maximum accuracy of 87.22%, as tested on an audio file with speech segments from three unknown speakers. The results are considered encouraging regarding the applicability of the proposed methodology.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: 2a Web Application For Live Radio Production and Annotationmentioning

confidence: 99%

Web Radio Automation for Audio Stream Management in the Era of Big Data

2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…Speech Emotion Recognition (SER) consists of the identification of the emotional content of speech signals, the task of recognizing human emotions and affective states from speech. In the SER field, there are three important aspects being studied and discussed in the literature: the choice of suitable acoustic features [9], the design of an appropriate classifier [10] and the generation of an emotional speech database [11][12][13]. Some works propose multimodal approaches combining visual and speech data to improve and strengthen emotion recognition systems [14,15].…”

Section: Technological Challengesmentioning

confidence: 99%

Data Augmentation for Speaker Identification under Stress Conditions to Combat Gender-Based Violence

et al. 2019

View full text Add to dashboard Cite

A Speaker Identification system for a personalized wearable device to combat gender-based violence is presented in this paper. Speaker recognition systems exhibit a decrease in performance when the user is under emotional or stress conditions, thus the objective of this paper is to measure the effects of stress in speech to ultimately try to mitigate their consequences on a speaker identification task, by using data augmentation techniques specifically tailored for this purpose given the lack of data resources for this condition. An extensive experimentation has been carried out for assessing the effectiveness of the proposed techniques. First, we conclude that the best performance is always obtained when naturally stressed samples are included in the training set, and second, when these are not available, their substitution and augmentation with synthetically generated stress-like samples improves the performance of the system.

show abstract

“…In this context, new audio recognition and semantic analysis techniques are deployed for General Audio Detection and Classification (GADC) tasks, which are very useful in many multidisciplinary domains [4][5][6][7][8][9][10][11][12][13][14][15][16]. Typical examples include speech recognition and perceptual enhancement [5][6][7][8], speaker indexing and diarization [14][15][16][17][18][19], voice/music detection and discrimination [1][2][3][4][9][10][11][12][13][20][21][22], information retrieval and genre classification of music [23,24], audio-driven alignment of multiple recordings [25,26], sound emotion recognition [27][28][29] and others [10,[30][31][32]. Concerning the media production and broadcasting domain, audio and audio-driven segmentation allow for the implementation of prope...…”

Section: Introductionmentioning

confidence: 99%

Investigation of Spoken-Language Detection and Classification in Broadcasted Audio Content

et al. 2020

Self Cite

View full text Add to dashboard Cite

The current paper focuses on the investigation of spoken-language classification in audio broadcasting content. The approach reflects a real-word scenario, encountered in modern media/monitoring organizations, where semi-automated indexing/documentation is deployed, which could be facilitated by the proposed language detection preprocessing. Multilingual audio recordings of specific radio streams are formed into a small dataset, which is used for the adaptive classification experiments, without seeking-at this step-for a generic language recognition model. Specifically, hierarchical discrimination schemes are followed to separate voice signals before classifying the spoken languages. Supervised and unsupervised machine learning is utilized at various windowing configurations to test the validity of our hypothesis. Besides the analysis of the achieved recognition scores (partial and overall), late integration models are proposed for semi-automatically annotation of new audio recordings. Hence, data augmentation mechanisms are offered, aiming at gradually formulating a Generic Audio Language Classification Repository. This database constitutes a program-adaptive collection that, beside the self-indexing metadata mechanisms, could facilitate generic language classification models in the future, through state-of-art techniques like deep learning. This approach matches the investigatory inception of the project, which seeks for indicators that could be applied in a second step with a larger dataset and/or an already pre-trained model, with the purpose to deliver overall results.

show abstract

Speech Emotion Recognition for Performance Interaction

Cited by 54 publications

References 0 publications

Web Radio Automation for Audio Stream Management in the Era of Big Data

Web Radio Automation for Audio Stream Management in the Era of Big Data

Data Augmentation for Speaker Identification under Stress Conditions to Combat Gender-Based Violence

Investigation of Spoken-Language Detection and Classification in Broadcasted Audio Content

Contact Info

Product

Resources

About