Intelligent nanoscope for rapid nanomaterial identification and classification

One of the challenges in Speech Emotion Recognition (SER) "in the wild" is the large mismatch between training and test data (e.g. speakers and tasks). In order to improve the generalisation capabilities of the emotion models, we propose to use Multi-Task Learning (MTL) and use gender and naturalness as auxiliary tasks in deep neural networks. This method was evaluated in within-corpus and various cross-corpus classification experiments that simulate conditions "in the wild". In comparison to Single-Task Learning (STL) based state of the art methods, we found that our MTL method proposed improved performance significantly. Particularly, models using both gender and naturalness achieved more gains than those using either gender or naturalness separately. This benefit was also found in the high-level representations of the feature space, obtained from our method proposed, where discriminative emotional clusters could be observed.

show abstract

Learning spectro-temporal features with 3D CNNs for speech emotion recognition

Kim

Truong

Englebienne

et al. 2017

View full text Add to dashboard Cite

In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract shortterm and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other stateof-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectrotemporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions.

show abstract

Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition

Kim

Englebienne

Truong

et al. 2017

View full text Add to dashboard Cite

Deep architectures using identity skip-connections have demonstrated groundbreaking performance in the eld of image classication. Recently, empirical studies suggested that identity skipconnections enable ensemble-like behaviour of shallow networks, and that depth is not a solo ingredient for their success. Therefore, we examine the potential of identity skip-connections for the task of Speech Emotion Recognition (SER) where moderately deep temporal architectures are often employed. To this end, we propose a novel architecture which regulates unimpeded feature ows and captures long-term dependencies via gate-based skip-connections and a memory mechanism. Our proposed architecture is compared to other state-of-the-art methods of SER and is evaluated on large aggregated corpora recorded in di erent contexts. Our proposed architecture outperforms the state-of-the-art methods by 9 -15% and achieves an Unweighted Accuracy of 80.5% in an imbalanced class distribution. In addition, we examine a variant adopting simpli ed skip-connections of Residual Networks (ResNet) and show that gate-based skip-connections are more e ective than simpli ed skip-connections.

show abstract

Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning

Kim¹,

Englebienne²,

Truong³

et al. 2017

Preprint

View full text Add to dashboard Cite

Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition

Kim

Park

2016

Engineering Applications of Artificial Intelligence

View full text Add to dashboard Cite

Automatic detection of children's engagement using non-verbal features and ordinal learning

Kim¹,

Truong²,

Evers³

2016

View full text Add to dashboard Cite

In collaborative play, young children can exhibit different types of engagement. Some children are engaged with other children in the play activity while others are just looking. In this study, we investigated methods to automatically detect the children's levels of engagement in play settings using non-verbal vocal features. Rather than labelling the level of engagement in an absolute manner, as has frequently been done in previous related studies, we designed an annotation scheme that takes the order of children's engagement levels into account. Taking full advantage of the ordinal annotations, we explored the use of SVM-based ordinal learning, i.e. ordinal regression and ranking, and compared these to a rule-based ranking and a classification method. We found promising performances for the ordinal methods. Particularly, the ranking method demonstrated the most robust performance against the large variation of children and their interactions.

show abstract

Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling

Korte¹,

Kim

Klabbers

2020

View full text Add to dashboard Cite

Recent advances in neural TTS have led to models that can produce high-quality synthetic speech. However, these models typically require large amounts of training data, which can make it costly to produce a new voice with the desired quality. Although multi-speaker modeling can reduce the data requirements necessary for a new voice, this approach is usually not viable for many low-resource languages for which abundant multi-speaker data is not available. In this paper, we therefore investigated to what extent multilingual multi-speaker modeling can be an alternative to monolingual multi-speaker modeling, and explored how data from foreign languages may best be combined with low-resource language data. We found that multilingual modeling can increase the naturalness of low-resource language speech, showed that multilingual models can produce speech with a naturalness comparable to monolingual multispeaker models, and saw that the target language naturalness was affected by the strategy used to add foreign language data.

show abstract

Dynamics of social positioning patterns in group-robot interactions

Vroon

Joosse

Lohse

et al. 2015

View full text Add to dashboard Cite

Abstract-When a mobile robot interacts with a group of people, it has to consider its position and orientation. We introduce a novel study aimed at generating hypotheses on suitable behavior for such social positioning, explicitly focusing on interaction with small groups of users and allowing for the temporal and social dynamics inherent in most interactions. In particular, the interactions we look at are approach, converse and retreat. In this study, groups of three participants and a telepresence robot (controlled remotely by a fourth participant) solved a task together while we collected quantitative and qualitative data, including tracking of positioning/orientation and ratings of the behaviors used. In the data we observed a variety of patterns that can be extrapolated to hypotheses using inductive reasoning. One such pattern/hypothesis is that a (telepresence) robot could pass through a group when retreating, without this affecting how comfortable that retreat is for the group members. Another is that a group will rate the position/orientation of a (telepresence) robot as more comfortable when it is aimed more at the center of that group.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jaebok Kim

Towards Speech Emotion Recognition “in the Wild” Using Aggregated Corpora and Deep Multi-Task Learning

Learning spectro-temporal features with 3D CNNs for speech emotion recognition

Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition

Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning

Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition

Automatic detection of children's engagement using non-verbal features and ordinal learning

Efficient Neural Speech Synthesis for Low-Resource Languages Through Multilingual Modeling

Dynamics of social positioning patterns in group-robot interactions

Contact Info

Product

Resources

About