Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

Mortazavi, Masood

doi:10.21437/interspeech.2020-3024

Cited by 6 publications

(4 citation statements)

References 26 publications

(40 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Starting from the work of Synnaeve, Versteegh, and Dupoux (2014); Harwath and Glass (2015), researchers have studied the ability of models to learn to recognize the structure of spoken language, such as words and sub-word units, by training the models to associate speech waveforms with contextually relevant visual inputs. These works have looked at a variety of tasks, such as speech-image retrieval (Harwath, Torralba, and Glass 2016;Chrupała 2019;Ilharco, Zhang, and Baldridge 2019;Mortazavi 2020;Sanabria, Waters, and Baldridge 2021), automatic speech recognition (Sun, Harwath, and Glass 2016;Palaskar, Sanabria, and Metze 2018;Hsu, Harwath, and Glass 2019), word detection and localization (Kamper et al 2017;Harwath and Glass 2017;Merkx, Frank, and Ernestus 2019;Wang and Hasegawa-Johnson 2020;Olaleye and Kamper 2021), hierarchical linguistic unit analysis (Chrupała, Gelderloos, and Alishahi 2017;Harwath, Hsu, and Glass 2020), cross-modality alignment Wang et al 2021;Khorrami and Räsänen 2021), speech segmentation , speech generation (Hsu et al 2021b), and learning multilingual speech representations (Harwath, Chuang, and Glass 2018;Kamper and Roth 2018;Havard, Chevrot, and Besacier 2020;Ohishi et al 2020). In this paper, we study the recently proposed FaST-VGS (Peng and Harwath 2021) speech-image retrieval model, and and propose a novel extention of the model that incorporates a wav2vec2.0style ) masked language modeling objective in a multi-task learning framework.…”

Section: Related Workmentioning

confidence: 99%

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

Puyuan¹,

Harwath²

2022

Preprint

View full text Add to dashboard Cite

In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS model, which is a Transformer-based model that learns to associate raw speech waveforms with semantically related images, all without the use of any transcriptions of the speech. Additionally, we introduce a novel extension of this model, FaST-VGS+ 1 , which is learned in a multi-task fashion with a masked language modeling objective in addition to the visual grounding objective. On ZeroSpeech 2021, we show that our models perform competitively on the ABX task, outperform all other concurrent submissions on the Syntactic and Semantic tasks, and nearly match the best system on the Lexical task. On the SUPERB benchmark, we show that our models also achieve strong performance, in some cases even outperforming the popular wav2vec2.0 model.

show abstract

Section: Related Workmentioning

confidence: 99%

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

Puyuan¹,

Harwath²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Several works [2,3,15] demonstrate the ability to learn semantic relationships between objects in images and the spoken words describing them using only the pairing between images and spoken captions as supervision. Using this framework, researchers have proposed improved image encoders, audio encoders, and loss functions [4][5][6][7][8][16][17][18][19][20]. Harwath et al [3,4,21] collected 400k spoken audio captions of images in the Places205 [22] dataset in English, which is one of the largest spoken caption datasets.…”

Section: Related Workmentioning

confidence: 99%

Cascaded Multilingual Audio-Visual Learning from Videos

Rouditchenko¹,

Boggust²,

Harwath³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Although retrieval accuracy was often used as an evaluation benchmark to assess how well a model can predict visual semantics directly from a raw speech signal, in many cases these papers put a greater emphasis on analyzing how linguistic structure emerged within the representations learned by the model. In general, the accuracy of speech-image retrieval systems has lagged behind their text-image counterparts, but recently several works have made enormous progress towards closing this gap, demonstrating that speech-enabled image retrieval is a compelling application in its own right [20,21,22,23].…”

Section: Introduction and Related Workmentioning

confidence: 99%

Fast-Slow Transformer for Visually Grounding Speech

Puyuan¹,

Harwath²

2021

Preprint

View full text Add to dashboard Cite

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, reaping the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-art speech-image retrieval accuracy on benchmark datasets, and its learned representations exhibit strong performance on the ZeroSpeech 2021 phonetic and semantic tasks.

show abstract

Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

Cited by 6 publications

References 26 publications

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

Cascaded Multilingual Audio-Visual Learning from Videos

Fast-Slow Transformer for Visually Grounding Speech

Contact Info

Product

Resources

About