2023
DOI: 10.1109/taffc.2021.3062406
|View full text |Cite
|
Sign up to set email alerts
|

Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition?

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
2
2

Relationship

1
9

Authors

Journals

citations
Cited by 32 publications
(22 citation statements)
references
References 55 publications
0
18
0
Order By: Relevance
“…Multimodal inference Self-supervised pre-training can outperform full-supervised training and is useful in preventing overfitting to smaller data sets. Shukla et al [308] showed the potential of visual self-supervision for learning audio functions. They proposed that joint visual and audio self-supervision leads to more informative audio representations for speech and emotion recognition.…”
Section: ) Metaverse Implementationsmentioning
confidence: 99%
“…Multimodal inference Self-supervised pre-training can outperform full-supervised training and is useful in preventing overfitting to smaller data sets. Shukla et al [308] showed the potential of visual self-supervision for learning audio functions. They proposed that joint visual and audio self-supervision leads to more informative audio representations for speech and emotion recognition.…”
Section: ) Metaverse Implementationsmentioning
confidence: 99%
“…We jointly optimize a family of self-supervised tasks in an encoderdecoder setup, making this work an example of multitask self-supervised learning. Multi-task self-supervised learning has been applied to other domains such as visual data [11,24], accelerometer recordings [35], audio [34] and multi-modal inputs [37,30]. Generally in each of these domains, tasks are defined ahead of time, as is the case for tasks such as frame reconstruction, colorization, finding relative position of image patches, mapping videos to optimal flow, and video-audio alignment.…”
Section: Related Workmentioning
confidence: 99%
“…Early researches focus on unimodal emotion recognition such as facial expression recognition (FER), speech emotion recognition (SER) and textual emotion recognition (TER), which attempt to learning emotional features from face, vocals and words of humans, respectively. Some studies also seem other modality as auxiliary to improve the performance of emotion recognition in primary modality during training [1] [2].…”
Section: Introductionmentioning
confidence: 99%