Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

Monfort, Mathew; Jin, SouYoung; Liu, Alexander; Harwath, David; Feris, Rogério; Glass, James; Oliva, Aude

doi:10.1109/cvpr46437.2021.01463

Cited by 30 publications

(12 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Models trained on such instructional video datasets often do not generalize well to other domains. Monfort et al (2021) highlight this limitation and show that training on their larger and more diverse Spoken Moments in Time dataset leads to better generalization. The point remains that these video datasets contain descriptive speech, thus ensuring that there is a strong correlation between the spoken language and their visual context, a characteristic that is not representative of the experience of learning language in the real world.…”

Section: Spoken Language Grounded In Videomentioning

confidence: 96%

Learning English with Peppa Pig

Nikolaus,

Alishahi,

Chrupała

2022

Preprint

View full text Add to dashboard Cite

Attempts to computationally simulate the acquisition of spoken language via grounding in perception have a long tradition but have gained momentum in the past few years. Current neural approaches exploit associations between the spoken and visual modality and learn to represent speech and visual data in a joint vector space. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Such a setup guarantees an unrealistically strong correlation between speech and the visual world. In the real world the coupling between the linguistic and the visual is loose, and often contains confounds in the form of correlations with non-semantic aspects of the speech signal. The current study is a first step towards simulating a naturalistic grounding scenario by using a dataset based on the children's cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of naturalistic dialog between characters, and evaluate on segments containing descriptive narrations. Despite the weak and confounded signal in this training data our model succeeds at learning aspects of the visual semantics of spoken language.

show abstract

Section: Spoken Language Grounded In Videomentioning

confidence: 96%

Learning English with Peppa Pig

Nikolaus,

Alishahi,

Chrupała

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The recently released WebVid2M dataset [9] comprises manually annotated captions, but given the monetary incentive on stock sites, they often contain added metatags appended, and most lack audio. Another valuable recent dataset is Spoken Moments in Time [56], however this was created with significant manual effort. The largest video-text dataset by far is HowTo100M [55] generated from ASR in instructional videos; however, this data is particularly noisy, as discussed in the introduction.…”

Section: Retrieval)mentioning

confidence: 99%

Learning Audio-Video Modalities from Image Captions

Nagrani¹,

Seo²,

Seybold³

et al. 2022

Preprint

View full text Add to dashboard Cite

A major challenge in text-video and text-audio retrieval is the lack of large-scale training data. This is unlike imagecaptioning, where datasets are in the order of millions of samples. To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. Using this pipeline, we create a new largescale, weakly labelled audio-video captioning dataset consisting of millions of paired clips and captions. We show that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning, matching or even outperforming HowTo100M pretraining with 20x fewer clips. We also show that our mined clips are suitable for text-audio pretraining, and achieve state of the art results for the task of audio retrieval.

show abstract

“…For visual and speech pair datasets, we turn to the Spoken Moments in Time (SMiT), a video-narration dataset. SMiT comprises 500k spoken captions each of which depicts a broad range of different events in a short video (Monfort et al, 2021).…”

Section: Large-scale Multimodal Pretraining Datamentioning

confidence: 99%

i-Code: An Integrative and Composable Multimodal Learning Framework

Yang¹,

Fang²,

C³

et al. 2022

Preprint

View full text Add to dashboard Cite

Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. In this framework, data from each modality are first given to pretrained single-modality encoders. The encoder outputs are then integrated with a multimodal fusion network, which uses novel attention mechanisms and other architectural innovations to effectively combine information from the different modalities. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Unlike previous research using only video for pretraining, the i-Code framework can dynamically process single, dual, and triple-modality data during training and inference, flexibly projecting different combinations of modalities into a single representation space. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11% and demonstrating the power of integrative multimodal pretraining.

show abstract

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

Cited by 30 publications

References 43 publications

Learning English with Peppa Pig

Learning English with Peppa Pig

Learning Audio-Video Modalities from Image Captions

i-Code: An Integrative and Composable Multimodal Learning Framework

Contact Info

Product

Resources

About