2021
DOI: 10.48550/arxiv.2109.08186
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Fast-Slow Transformer for Visually Grounding Speech

Abstract: We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS. FaST-VGS is a Transformer-based model for learning the associations between raw speech waveforms and visual images. The model unifies dual-encoder and cross-attention architectures into a single model, reaping the superior retrieval speed of the former along with the accuracy of the latter. FaST-VGS achieves state-of-the-art speech-image retrieval accuracy on benchmark datasets, and its learned representations exhibit strong performan… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(10 citation statements)
references
References 23 publications
0
10
0
Order By: Relevance
“…These works have looked at a variety of tasks, such as speech-image retrieval (Harwath, Torralba, and Glass 2016;Chrupała 2019;Ilharco, Zhang, and Baldridge 2019;Mortazavi 2020;Sanabria, Waters, and Baldridge 2021), automatic speech recognition (Sun, Harwath, and Glass 2016;Palaskar, Sanabria, and Metze 2018;Hsu, Harwath, and Glass 2019), word detection and localization (Kamper et al 2017;Harwath and Glass 2017;Merkx, Frank, and Ernestus 2019;Wang and Hasegawa-Johnson 2020;Olaleye and Kamper 2021), hierarchical linguistic unit analysis (Chrupała, Gelderloos, and Alishahi 2017;Harwath, Hsu, and Glass 2020), cross-modality alignment Wang et al 2021;Khorrami and Räsänen 2021), speech segmentation , speech generation (Hsu et al 2021b), and learning multilingual speech representations (Harwath, Chuang, and Glass 2018;Kamper and Roth 2018;Havard, Chevrot, and Besacier 2020;Ohishi et al 2020). In this paper, we study the recently proposed FaST-VGS (Peng and Harwath 2021) speech-image retrieval model, and and propose a novel extention of the model that incorporates a wav2vec2.0style ) masked language modeling objective in a multi-task learning framework.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…These works have looked at a variety of tasks, such as speech-image retrieval (Harwath, Torralba, and Glass 2016;Chrupała 2019;Ilharco, Zhang, and Baldridge 2019;Mortazavi 2020;Sanabria, Waters, and Baldridge 2021), automatic speech recognition (Sun, Harwath, and Glass 2016;Palaskar, Sanabria, and Metze 2018;Hsu, Harwath, and Glass 2019), word detection and localization (Kamper et al 2017;Harwath and Glass 2017;Merkx, Frank, and Ernestus 2019;Wang and Hasegawa-Johnson 2020;Olaleye and Kamper 2021), hierarchical linguistic unit analysis (Chrupała, Gelderloos, and Alishahi 2017;Harwath, Hsu, and Glass 2020), cross-modality alignment Wang et al 2021;Khorrami and Räsänen 2021), speech segmentation , speech generation (Hsu et al 2021b), and learning multilingual speech representations (Harwath, Chuang, and Glass 2018;Kamper and Roth 2018;Havard, Chevrot, and Besacier 2020;Ohishi et al 2020). In this paper, we study the recently proposed FaST-VGS (Peng and Harwath 2021) speech-image retrieval model, and and propose a novel extention of the model that incorporates a wav2vec2.0style ) masked language modeling objective in a multi-task learning framework.…”
Section: Related Workmentioning
confidence: 99%
“…In this section, we describe two models, namely FaST-VGS and FaST-VGS+. FaST-VGS is a transformer-based VGS model proposed by (Peng and Harwath 2021), which is trained using a contrastive speech-image retrieval loss (Ilharco, Zhang, and Baldridge 2019). FaST-VGS+ augments FaST-VGS with a wav2vec2.0-style masked language modeling loss ).…”
Section: Modelsmentioning
confidence: 99%
See 3 more Smart Citations