Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition?

Shukla, Abhinav; Petridis, Stavros; Pantić, Maja

doi:10.1109/taffc.2021.3062406

Cited by 32 publications

(22 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multimodal inference Self-supervised pre-training can outperform full-supervised training and is useful in preventing overfitting to smaller data sets. Shukla et al [308] showed the potential of visual self-supervision for learning audio functions. They proposed that joint visual and audio self-supervision leads to more informative audio representations for speech and emotion recognition.…”

Section: ) Metaverse Implementationsmentioning

confidence: 99%

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

2022

View full text Add to dashboard Cite

Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is based on the social value of Generation Z that online and offline selves are not different. With the technological development of deep learning-based high-precision recognition models and natural generation models, Metaverse is being strengthened with various factors, from mobile-based always-on access to connectivity with reality using virtual currency. The integration of enhanced social activities and neural-net methods requires a new definition of Metaverse suitable for the present, different from the previous Metaverse. This paper divides the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) and three approaches (i.e., user interaction, implementation, and application) rather than marketing or hardware approach to conduct a comprehensive analysis. Furthermore, we describe essential methods based on three components and techniques to Metaverse's representative Ready Player One, Roblox, and Facebook research in the domain of films, games, and studies. Finally, we summarize the limitations and directions for implementing the immersive Metaverse as social influences, constraints, and open challenges.

show abstract

Section: ) Metaverse Implementationsmentioning

confidence: 99%

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

2022

View full text Add to dashboard Cite

show abstract

“…We jointly optimize a family of self-supervised tasks in an encoderdecoder setup, making this work an example of multitask self-supervised learning. Multi-task self-supervised learning has been applied to other domains such as visual data [11,24], accelerometer recordings [35], audio [34] and multi-modal inputs [37,30]. Generally in each of these domains, tasks are defined ahead of time, as is the case for tasks such as frame reconstruction, colorization, finding relative position of image patches, mapping videos to optimal flow, and video-audio alignment.…”

Section: Related Workmentioning

confidence: 99%

Task Programming: Learning Data Efficient Behavior Representations

Sun

Kennedy²,

Zhan

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

“…Early researches focus on unimodal emotion recognition such as facial expression recognition (FER), speech emotion recognition (SER) and textual emotion recognition (TER), which attempt to learning emotional features from face, vocals and words of humans, respectively. Some studies also seem other modality as auxiliary to improve the performance of emotion recognition in primary modality during training [1] [2].…”

Section: Introductionmentioning

confidence: 99%

Deep Emotion Recognition using Facial, Speech and Textual Cues: A Survey

Zhang¹,

Tan²

2023

Preprint

View full text Add to dashboard Cite

With the development of social media and human-computer interaction, it is essential to serve people by perceiving people's emotional state in videos. In recent years, a large number of studies tackle the issue of emotion recognition based on three most common modalities in videos, that is, face, speech and text. The focus of this paper is to sort out the relevant studies of emotion recognition using facial, speech and textual cues based on deep learning techniques due to the lack of review papers concentrating on the three modalities. In this paper, we firstly introduce widely accepted emotion models for the purpose of interpreting the definition of emotion. Then we introduce the state-of-the-art for emotion recognition based on unimodality including facial expression recognition, speech emotion recognition and textual emotion recognition. For multimodal emotion recognition, we summarize the feature-level and decision-level fusion methods in detail. In addition, the description of relevant benchmark datasets, the definition of metrics and the performance of the state-of-the-art in recent years are also outlined for the convenience of readers to find out the current research progress. Ultimately, we explore some potential research challenges and opportunities to give researchers reference for the enrichment of emotion recognition-related researches.

show abstract

Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition?

Cited by 32 publications

References 55 publications

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

A Metaverse: Taxonomy, Components, Applications, and Open Challenges

Task Programming: Learning Data Efficient Behavior Representations

Deep Emotion Recognition using Facial, Speech and Textual Cues: A Survey

Contact Info

Product

Resources

About