Noé Tits scite author profile

Understanding expressed sentiment and emotions are two crucial factors in human multimodal language. This paper describes a Transformer-based joint-encoding (TBJE) for the task of Emotion Recognition and Sentiment Analysis. In addition to use the Transformer architecture, our approach relies on a modular co-attention and a glimpse layer to jointly encode one or more modalities. The proposed solution has also been submitted to the ACL20: Second Grand-Challenge on Multimodal Language to be evaluated on the CMU-MOSEI dataset. The code to replicate the presented experiments is open-source 1 .

show abstract

Exploring Transfer Learning for Low Resource Emotional TTS

Tits

Haddad

Dutoit

2019

View full text Add to dashboard Cite

During the last few years, spoken language technologies have known a big improvement thanks to Deep Learning. However Deep Learningbased algorithms require amounts of data that are often difficult and costly to gather. Particularly, modeling the variability in speech of different speakers, different styles or different emotions with few data remains challenging. In this paper, we investigate how to leverage fine-tuning on a pre-trained Deep Learning-based TTS model to synthesize speech with a small dataset of another speaker. Then we investigate the possibility to adapt this model to have emotional TTS by fine-tuning the neutral TTS model with a small emotional dataset.

show abstract

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech - a Deep Learning approach

Tits

2019

View full text Add to dashboard Cite

In this project, we aim to build a Text-to-Speech system able to produce speech with a controllable emotional expressiveness. We propose a methodology for solving this problem in three main steps. The first is the collection of emotional speech data. We discuss the various formats of existing datasets and their usability in speech generation. The second step is the development of a system to automatically annotate data with emotion/expressiveness features. We compare several techniques using transfer learning to extract such a representation through other tasks and propose a method to visualize and interpret the correlation between vocal and emotional features. The third step is the development of a deep learning-based system taking text and emotion/expressiveness as input and producing speech as output. We study the impact of fine tuning from a neutral TTS towards an emotional TTS in terms of intelligibility and perception of the emotion.

show abstract

Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis Through Audio Analysis

Tits

Wang²,

Haddad

et al. 2019

View full text Add to dashboard Cite

The field of Text-to-Speech has experienced huge improvements last years benefiting from deep learning techniques. Producing realistic speech becomes possible now. As a consequence, the research on the control of the expressiveness, allowing to generate speech in different styles or manners, has attracted increasing attention lately. Systems able to control style have been developed and show impressive results. However the control parameters often consist of latent variables and remain complex to interpret.In this paper, we analyze and compare different latent spaces and obtain an interpretation of their influence on expressive speech. This will enable the possibility to build controllable speech synthesis systems with an understandable behaviour.

show abstract

ASR-based Features for Emotion Recognition: A Transfer Learning Approach

Tits¹,

Haddad²,

Dutoit³

2018

View full text Add to dashboard Cite

During the last decade, the applications of signal processing have drastically improved with deep learning. However areas of affecting computing such as emotional speech synthesis or emotion recognition from spoken language remains challenging. In this paper, we investigate the use of a neural Automatic Speech Recognition (ASR) as a feature extractor for emotion recognition. We show that these features outperform the eGeMAPS feature set to predict the valence and arousal emotional dimensions, which means that the audio-to-text mapping learned by the ASR system contains information related to the emotional dimensions in spontaneous speech. We also examine the relationship between first layers (closer to speech) and last layers (closer to text) of the ASR and valence/arousal.

show abstract

Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

Delbrouck¹,

Tits²,

Dupont³

2020

View full text Add to dashboard Cite

This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis. Our motivation is to propose two architectures based on Transformers and modulation that combine the linguistic and acoustic inputs from a wide range of datasets to challenge, and sometimes surpass, the state-of-the-art in the field. To demonstrate the efficiency of our models, we carefully evaluate their performances on the IEMOCAP, MOSI, MOSEI and MELD dataset. The experiments can be directly replicated and the code is fully open for future researches 1 .

show abstract

Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

Jean-Benoit¹,

Tits²,

Dupont³

2020

Preprint

View full text Add to dashboard Cite

show abstract

ASR-based Features for Emotion Recognition: A Transfer Learning Approach

Tits¹,

Haddad²,

Dutoit³

2018

Preprint

View full text Add to dashboard Cite

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Noé Tits

A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

Exploring Transfer Learning for Low Resource Emotional TTS

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech - a Deep Learning approach

Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis Through Audio Analysis

ASR-based Features for Emotion Recognition: A Transfer Learning Approach

Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

ASR-based Features for Emotion Recognition: A Transfer Learning Approach

Contact Info

Product

Resources

About