2022
DOI: 10.48550/arxiv.2202.03543
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

Abstract: In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS model, which is a Transformer-based model that learns to associate raw speech waveforms with semantically related images, all without the use of any transcriptions of the speech. Additionally, we introduce a novel extension of this model, FaST-VGS+ 1 , which is learned in a multi-task fashion with a masked language modeling objective in addition to the vi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Publication Types

Select...

Relationship

0
0

Authors

Journals

citations
Cited by 0 publications
references
References 50 publications
0
0
0
Order By: Relevance

No citations

Set email alert for when this publication receives citations?