Self-Supervised Learning of Class Embeddings from Video

Wiles, Olivia; Koepke, A. Sophia; Zisserman, Andrew

doi:10.1109/iccvw.2019.00364

Cited by 51 publications

(61 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We used the pre-trained Fabnet [16] model to obtain embeddings for each frame in the video that contained the speaker's face. The pretext task of Fabnet is specially designed to encourage the network to learn the facial attributes that encodes the landmarks, pose, and emotions.…”

Section: ) Fabnetmentioning

confidence: 99%

“…Both pre-trained models of RoBERTa [19] and Wav2Vec [17] were accessed from Fairseq code-base [34] and used to extract text and speech SSL features. To download the pre-trained Fabnet model and extract features for video modality, we referred to their publication [16]. To extract features from videos, we cropped faces from each video frame using Retina-Face [42] facial recognition model.…”

Section: A Self Supervised Embedding Extractionmentioning

confidence: 99%

“…Considering these problems, we propose a reliable and effective SSL feature fusion mechanism based on the core concepts of Self-Attention [18] and Transformers [18]- [20]. We used three publicly available pre-trained SSL models of RoBERTa [19] to represent text, Wav2Vec [17] to represent speech and FAb-net [16] to represent facial expressions.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion

et al. 2020

View full text Add to dashboard Cite

Emotion Recognition is a challenging research area given its complex nature, and humans express emotional cues across various modalities such as language, facial expressions, and speech. Representation and fusion of features are the most crucial tasks in multimodal emotion recognition research. Self Supervised Learning (SSL) has become a prominent and influential research direction in representation learning, where researchers have access to pre-trained SSL models that represent different data modalities. For the first time in the literature, we represent three input modalities of text, audio (speech), and vision with features extracted from independently pre-trained SSL models in this paper. Given the high dimensional nature of SSL features, we introduce a novel Transformers and Attention-based fusion mechanism that can combine multimodal SSL features and achieve state-of-the-art results for the task of multimodal emotion recognition. We benchmark and evaluate our work to show that our model is robust and outperforms the state-of-the-art models on four datasets.

show abstract

Section: ) Fabnetmentioning

confidence: 99%

Section: A Self Supervised Embedding Extractionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Other methods use no supervision, and some no data-driven prior either. The works of [28,45,50,67] learn to match pairs of images of an object, but they do not learn geometric invariants such as keypoints. [54,55,56] do learn sparse and dense landmarks, also without any annotation.…”

Section: Related Workmentioning

confidence: 99%

“…In this way, our method outputs poses that are directly interpretable. By contrast, state-of-the-art self-supervised keypoint detectors [25,50,54,67,74] do not learn "semantic" keypoints and, in post-processing, they need at least some paired supervision to output human-interpretable keypoints. We highlight this difference in fig.…”

Section: Introductionmentioning

confidence: 99%

Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos

Jakab

Gupta

Bilen

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We propose a new method for recognizing the pose of objects from a single image that for learning uses only unlabelled videos and a weak empirical prior on the object poses. Video frames differ primarily in the pose of the objects they contain, so our method distils the pose information by analyzing the differences between frames. The distillation uses a new dual representation of the geometry of objects as a set of 2D keypoints, and as a pictorial representation, i.e. a skeleton image. This has three benefits: (1) it provides a tight 'geometric bottleneck' which disentangles pose from appearance, (2) it can leverage powerful image-to-image translation networks to map between photometry and geometry, and (3) it allows to incorporate empirical pose priors in the learning process. The pose priors are obtained from unpaired data, such as from a different dataset or modality such as mocap, such that no annotated image is ever used in learning the pose recognition network. In standard benchmarks for pose recognition for humans and faces, our method achieves state-of-the-art performance among methods that do not require any labelled images for training

show abstract

A Survey on Automatic Multimodal Emotion Recognition in the Wild

Sharma

Dhall

2020

Intelligent Systems Reference Library

View full text Add to dashboard Cite

Self-Supervised Learning of Class Embeddings from Video

Cited by 51 publications

References 49 publications

Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion

Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion

Self-Supervised Learning of Interpretable Keypoints From Unlabelled Videos

A Survey on Automatic Multimodal Emotion Recognition in the Wild

Contact Info

Product

Resources

About