Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

Nagrani, Arsha; Chung, Joon Son; Albanie, Samuel; Zisserman, Andrew

doi:10.1109/icassp40776.2020.9054057

Cited by 83 publications

(60 citation statements)

References 22 publications

(29 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first step is a shared convolutional feature extraction stage where a data-driven representation is extracted for both audio and video independently. The architectures for this first stage blocks are adopted from [25]. A second level temporal aggregation block pools the feature representation for audio and video separately over entire clips to fixed dimensional representation.…”

Section: Related Workmentioning

confidence: 99%

“…On a high level, the loss forces the embeddings such that, for the auxiliary task, each class is predicted with the same probability. Similar to [25], we implement the confusion loss as the cross-entropy between the predictions and a uniform distribution.…”

Section: Lprimary =Wem Prim * L(êprim Etarget)mentioning

confidence: 99%

“…The model architecture for the shared 2D Convolutional layers and the fully connected layers was adapted from [25] and modified to suit the dimensions of our inputs and outputs. We use uniform duration videos of 12 seconds each as input to our system.…”

Section: Experimental Settingsmentioning

confidence: 99%

See 2 more Smart Citations

Disentanglement for Audio-Visual Emotion Recognition Using Multitask Setup

Peri

Parthasarathy

Bradshaw

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Deep learning models trained on audio-visual data have been successfully used to achieve state-of-the-art performance for emotion recognition. In particular, models trained with multitask learning have shown additional performance improvements. However, such multitask models entangle information between the tasks, encoding the mutual dependencies present in label distributions in the real world data used for training. This work explores the disentanglement of multimodal signal representations for the primary task of emotion recognition and a secondary person identification task. In particular, we developed a multitask framework to extract low-dimensional embeddings that aim to capture emotion specific information, while containing minimal information related to person identity. We evaluate three different techniques for disentanglement and report results of up to 13% disentanglement while maintaining emotion recognition performance.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Lprimary =Wem Prim * L(êprim Etarget)mentioning

confidence: 99%

See 1 more Smart Citation

Disentanglement for Audio-Visual Emotion Recognition Using Multitask Setup

Peri

Parthasarathy

Bradshaw

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…In another work from Google [28], the representation is learnt by predicting the instantaneous frequency based on the magnitude of the Fourier transform. Furthermore, Arsha et al (2020) [29] proposed a cross-modal selfsupervised learning method to learn speech representation from the co-relationship between the face and the audio in the video. Other efforts have been made by researchers to learn a general representation by predicting the contextual frames of any particular audio frame like wav2vec [30], speech2vec [31], and audio word2vec [32].…”

Section: Background and Related Work A Audio Representation Learmentioning

confidence: 99%

High-Fidelity Audio Generation and Representation Learning With Guided Adversarial Autoencoder

2020

View full text Add to dashboard Cite

Generating high-fidelity conditional audio samples and learning representation from unlabelled audio data are two challenging problems in machine learning research. Recent advances in the Generative Adversarial Neural Networks (GAN) architectures show great promise in addressing these challenges. To learn powerful representation using GAN architecture, it requires superior sample generation quality, which requires an enormous amount of labelled data. In this paper, we address this issue by proposing Guided Adversarial Autoencoder (GAAE), which can generate superior conditional audio samples from unlabelled audio data using a small percentage of labelled data as guidance. Representation learned from unlabelled data without any supervision does not guarantee its' usability for any downstream task. On the other hand, during the representation learning, if the model is highly biased towards the downstream task, it losses its generalisation capability. This makes the learned representation hardly useful for any other tasks that are not related to that downstream task. The proposed GAAE model also address these issues. Using this superior conditional generation, GAAE can learn representation specific to the downstream task. Furthermore, GAAE learns another type of representation capturing the general attributes of the data, which is independent of the downstream task at hand. Experimental results involving the S09 and the NSynth dataset attest the superior performance of GAAE compared to the state-of-the-art alternatives.

show abstract

“…While recent years have shown great successes in speaker recognition [11,19,42,44], these successes have been reliant on the collection of large, labelled datasets such as VoxCeleb [12,35,36] and others [16,30]. The VoxCeleb datasets, while valuable, have been collected entirely from interviews of celebrities in YouTube videos and are limited in terms of linguistic content (celebrities mostly speak about their professions [33]), emotion, and background noise. In contrast, movies contain speech covering emotions such as anger, sadness, assertiveness, and fright, and varied background conditions -imagine the shouting in a violent scene from an action movie, or a romantic scene of reconciliation in a romcom.…”

Section: Introductionmentioning

confidence: 99%

Playing a Part: Speaker Verification at the movies

Brown

Huh

Nagrani

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

The goal of this work is to investigate the performance of popular speaker recognition models on speech segments from movies, where often actors intentionally disguise their voice to play a character. We make the following three contributions: (i) We collect a novel, challenging speaker recognition dataset called VoxMovies, with speech for 856 identities from almost 4000 movie clips. VoxMovies contains utterances with varying emotion, accents and background noise, and therefore comprises an entirely different domain to the interview-style, emotionally calm utterances in current speaker recognition datasets such as VoxCeleb; (ii) We provide a number of domain adaptation evaluation sets, and benchmark the performance of stateof-the-art speaker recognition models on these evaluation pairs. We demonstrate that both speaker verification and identification performance drops steeply on this new data, showing the challenge in transferring models across domains; and finally (iii) We show that simple domain adaptation paradigms improve performance, but there is still large room for improvement.

show abstract

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

Cited by 83 publications

References 22 publications

Disentanglement for Audio-Visual Emotion Recognition Using Multitask Setup

Disentanglement for Audio-Visual Emotion Recognition Using Multitask Setup

High-Fidelity Audio Generation and Representation Learning With Guided Adversarial Autoencoder

Playing a Part: Speaker Verification at the movies

Contact Info

Product

Resources

About