One of the challenges in Speech Emotion Recognition (SER) "in the wild" is the large mismatch between training and test data (e.g. speakers and tasks). In order to improve the generalisation capabilities of the emotion models, we propose to use Multi-Task Learning (MTL) and use gender and naturalness as auxiliary tasks in deep neural networks. This method was evaluated in within-corpus and various cross-corpus classification experiments that simulate conditions "in the wild". In comparison to Single-Task Learning (STL) based state of the art methods, we found that our MTL method proposed improved performance significantly. Particularly, models using both gender and naturalness achieved more gains than those using either gender or naturalness separately. This benefit was also found in the high-level representations of the feature space, obtained from our method proposed, where discriminative emotional clusters could be observed.
In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed 3D CNNs simultaneously extract shortterm and long-term spectral features with a moderate number of parameters. We evaluated our proposed and other stateof-the-art methods in a speaker-independent manner using aggregated corpora that give a large and diverse set of speakers. We found that 1) shallow temporal and moderately deep spectral kernels of a homogeneous architecture are optimal for the task; and 2) our 3D CNNs are more effective for spectrotemporal feature learning compared to other methods. Finally, we visualised the feature space obtained with our proposed method using t-distributed stochastic neighbour embedding (T-SNE) and could observe distinct clusters of emotions.
Deep architectures using identity skip-connections have demonstrated groundbreaking performance in the eld of image classication. Recently, empirical studies suggested that identity skipconnections enable ensemble-like behaviour of shallow networks, and that depth is not a solo ingredient for their success. Therefore, we examine the potential of identity skip-connections for the task of Speech Emotion Recognition (SER) where moderately deep temporal architectures are often employed. To this end, we propose a novel architecture which regulates unimpeded feature ows and captures long-term dependencies via gate-based skip-connections and a memory mechanism. Our proposed architecture is compared to other state-of-the-art methods of SER and is evaluated on large aggregated corpora recorded in di erent contexts. Our proposed architecture outperforms the state-of-the-art methods by 9 -15% and achieves an Unweighted Accuracy of 80.5% in an imbalanced class distribution. In addition, we examine a variant adopting simpli ed skip-connections of Residual Networks (ResNet) and show that gate-based skip-connections are more e ective than simpli ed skip-connections.
In collaborative play, young children can exhibit different types of engagement. Some children are engaged with other children in the play activity while others are just looking. In this study, we investigated methods to automatically detect the children's levels of engagement in play settings using non-verbal vocal features. Rather than labelling the level of engagement in an absolute manner, as has frequently been done in previous related studies, we designed an annotation scheme that takes the order of children's engagement levels into account. Taking full advantage of the ordinal annotations, we explored the use of SVM-based ordinal learning, i.e. ordinal regression and ranking, and compared these to a rule-based ranking and a classification method. We found promising performances for the ordinal methods. Particularly, the ranking method demonstrated the most robust performance against the large variation of children and their interactions.
Recent advances in neural TTS have led to models that can produce high-quality synthetic speech. However, these models typically require large amounts of training data, which can make it costly to produce a new voice with the desired quality. Although multi-speaker modeling can reduce the data requirements necessary for a new voice, this approach is usually not viable for many low-resource languages for which abundant multi-speaker data is not available. In this paper, we therefore investigated to what extent multilingual multi-speaker modeling can be an alternative to monolingual multi-speaker modeling, and explored how data from foreign languages may best be combined with low-resource language data. We found that multilingual modeling can increase the naturalness of low-resource language speech, showed that multilingual models can produce speech with a naturalness comparable to monolingual multispeaker models, and saw that the target language naturalness was affected by the strategy used to add foreign language data.
Abstract-When a mobile robot interacts with a group of people, it has to consider its position and orientation. We introduce a novel study aimed at generating hypotheses on suitable behavior for such social positioning, explicitly focusing on interaction with small groups of users and allowing for the temporal and social dynamics inherent in most interactions. In particular, the interactions we look at are approach, converse and retreat. In this study, groups of three participants and a telepresence robot (controlled remotely by a fourth participant) solved a task together while we collected quantitative and qualitative data, including tracking of positioning/orientation and ratings of the behaviors used. In the data we observed a variety of patterns that can be extrapolated to hypotheses using inductive reasoning. One such pattern/hypothesis is that a (telepresence) robot could pass through a group when retreating, without this affecting how comfortable that retreat is for the group members. Another is that a group will rate the position/orientation of a (telepresence) robot as more comfortable when it is aimed more at the center of that group.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.