The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English

Livingstone, Steven R.; Russo, Frank A.

doi:10.1371/journal.pone.0196391

Cited by 1,228 publications

(615 citation statements)

References 183 publications

Supporting

Mentioning

476

Contrasting

Unclassified

Order By: Relevance

“…(ii) These features are essentially audio only features that have been guided by the visual modality during training, and can thus be tested even on speech datasets that do not have the visual modality. (iii) The proposed features give state of the art performance on discrete emotion recognition on the CREMA-D [18] and Ravdess [19] datasets, and competitive performance with other self-supervised features on ASR on the GRID [20] and SPC datasets [21]. This shows the potential of visual supervision for learning audio representations.…”

Section: Introductionmentioning

confidence: 92%

Visually Guided Self Supervised Learning of Speech Representations

Shukla

Vougioukas

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very limited work that studies the interaction between the two modalities for learning self supervised representations. We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. Through this process, the audio encoder network learns useful speech representations that we evaluate on emotion recognition and speech recognition. We achieve state of the art results for emotion recognition and competitive results for speech recognition. This demonstrates the potential of visual supervision for learning audio representations as a novel way for self-supervised learning which has not been explored in the past. The proposed unsupervised audio features can leverage a virtually unlimited amount of training data of unlabelled audiovisual speech and have a large number of potentially promising applications.

show abstract

Section: Introductionmentioning

confidence: 92%

Visually Guided Self Supervised Learning of Speech Representations

Shukla

Vougioukas

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Figure shows three different scenarios for the emotion type angry , where the value of c is set to 0.1, 0.5, and 1, respectively. The audio transcript is “kids are talking by the door” taken from RAVDESS, the Ryerson Audio‐Visual Database of Emotional Speech and Song . The sentence is converted into SAMPA notation: “k I d z A: t O: k I N b aI D @ d O:”.…”

Section: Overviewmentioning

confidence: 99%

“…The audio transcript is "kids are talking by the door" taken from RAVDESS, the Ryerson Audio-Visual Database of Emotional Speech and Song. 30 The sentence is converted into SAMPA notation: "k I d z A: t O: k I N b aI D @ d O:". Subsequently, phoneme-to-viseme mapping turns the sentence into "GK IEE T SSS AHH T OHH GK IEE GK MMM AHH IEE TH Schwa T OHH RRR" as defined in Section 3.2.…”

Section: Coarticulationmentioning

confidence: 99%

Audio‐driven emotional speech animation for interactive virtual characters

Charalambous

Yumak

Stappen

2019

Computer Animation & Virtual

View full text Add to dashboard Cite

We present a procedural audio‐driven speech animation method for interactive virtual characters. Given any audio with its respective speech transcript, we automatically generate lip‐synchronized speech animation that could drive any three‐dimensional virtual character. The realism of the animation is enhanced by studying the emotional features of the audio signal and its effect on mouth movements. We also propose a coarticulation model that takes into account various linguistic rules. The generated animation is configurable by the user by modifying the control parameters, such as viseme types, intensities, and coarticulation curves. We compare our approach against two lip‐synchronized speech animation generators. Our results show that our method surpasses them in terms of user preference.

show abstract

“…Phrases were required to be between 100ms and 6 seconds and each improviser recorded between 50 to 200 samples for each quadrant. To validate this data we created a separate process whereby the pitch range, velocities and contour were compared to the RAVDESS [39] data set, with files removed when the variation was over a manually set threshold. RAVDESS contains speech files tagged with emotion, Figure 2 and Figure 3 clearly demonstrate the variety of prosody details apparent in the RAVDESS dataset (created using [40], [41]) and the variation between a calm and angry utterance of the same phrase.…”

Section: ) Dataset and Phrase Generationmentioning

confidence: 99%

Establishing Human-Robot Trust through Music-Driven Robotic Emotion Prosody and Gesture

Savery

Rose

Weinberg

2019

2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

View full text Add to dashboard Cite

As human-robot collaboration opportunities continue to expand, trust becomes ever more important for full engagement and utilization of robots. Affective trust, built on emotional relationship and interpersonal bonds is particularly critical as it is more resilient to mistakes and increases the willingness to collaborate. In this paper we present a novel model built on music-driven emotional prosody and gestures that encourages the perception of a robotic identity, designed to avoid uncanny valley. Symbolic musical phrases were generated and tagged with emotional information by human musicians. These phrases controlled a synthesis engine playing back prerendered audio samples generated through interpolation of phonemes and electronic instruments. Gestures were also driven by the symbolic phrases, encoding the emotion from the musical phrase to low degree-of-freedom movements. Through a user study we showed that our system was able to accurately portray a range of emotions to the user. We also showed with a significant result that our non-linguistic audio generation achieved an 8% higher mean of average trust than using a state-of-the-art text-to-speech system.

show abstract

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English

Cited by 1,228 publications

References 183 publications

Visually Guided Self Supervised Learning of Speech Representations

Visually Guided Self Supervised Learning of Speech Representations

Audio‐driven emotional speech animation for interactive virtual characters

Establishing Human-Robot Trust through Music-Driven Robotic Emotion Prosody and Gesture

Contact Info

Product

Resources

About