ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054057
|View full text |Cite
|
Sign up to set email alerts
|

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision

Abstract: The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart-without annotation-the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provid… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
59
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 83 publications
(60 citation statements)
references
References 22 publications
(29 reference statements)
0
59
0
Order By: Relevance
“…The first step is a shared convolutional feature extraction stage where a data-driven representation is extracted for both audio and video independently. The architectures for this first stage blocks are adopted from [25]. A second level temporal aggregation block pools the feature representation for audio and video separately over entire clips to fixed dimensional representation.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…The first step is a shared convolutional feature extraction stage where a data-driven representation is extracted for both audio and video independently. The architectures for this first stage blocks are adopted from [25]. A second level temporal aggregation block pools the feature representation for audio and video separately over entire clips to fixed dimensional representation.…”
Section: Related Workmentioning
confidence: 99%
“…On a high level, the loss forces the embeddings such that, for the auxiliary task, each class is predicted with the same probability. Similar to [25], we implement the confusion loss as the cross-entropy between the predictions and a uniform distribution.…”
Section: Lprimary =Wem Prim * L(êprim Etarget)mentioning
confidence: 99%
See 1 more Smart Citation
“…In another work from Google [28], the representation is learnt by predicting the instantaneous frequency based on the magnitude of the Fourier transform. Furthermore, Arsha et al (2020) [29] proposed a cross-modal selfsupervised learning method to learn speech representation from the co-relationship between the face and the audio in the video. Other efforts have been made by researchers to learn a general representation by predicting the contextual frames of any particular audio frame like wav2vec [30], speech2vec [31], and audio word2vec [32].…”
Section: Background and Related Work A Audio Representation Learmentioning
confidence: 99%
“…While recent years have shown great successes in speaker recognition [11,19,42,44], these successes have been reliant on the collection of large, labelled datasets such as VoxCeleb [12,35,36] and others [16,30]. The VoxCeleb datasets, while valuable, have been collected entirely from interviews of celebrities in YouTube videos and are limited in terms of linguistic content (celebrities mostly speak about their professions [33]), emotion, and background noise. In contrast, movies contain speech covering emotions such as anger, sadness, assertiveness, and fright, and varied background conditions -imagine the shouting in a violent scene from an action movie, or a romantic scene of reconciliation in a romcom.…”
Section: Introductionmentioning
confidence: 99%