2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01463
|View full text |Cite
|
Sign up to set email alerts
|

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
3
1
1

Relationship

1
9

Authors

Journals

citations
Cited by 30 publications
(12 citation statements)
references
References 43 publications
0
12
0
Order By: Relevance
“…Models trained on such instructional video datasets often do not generalize well to other domains. Monfort et al (2021) highlight this limitation and show that training on their larger and more diverse Spoken Moments in Time dataset leads to better generalization. The point remains that these video datasets contain descriptive speech, thus ensuring that there is a strong correlation between the spoken language and their visual context, a characteristic that is not representative of the experience of learning language in the real world.…”
Section: Spoken Language Grounded In Videomentioning
confidence: 96%
“…Models trained on such instructional video datasets often do not generalize well to other domains. Monfort et al (2021) highlight this limitation and show that training on their larger and more diverse Spoken Moments in Time dataset leads to better generalization. The point remains that these video datasets contain descriptive speech, thus ensuring that there is a strong correlation between the spoken language and their visual context, a characteristic that is not representative of the experience of learning language in the real world.…”
Section: Spoken Language Grounded In Videomentioning
confidence: 96%
“…The recently released WebVid2M dataset [9] comprises manually annotated captions, but given the monetary incentive on stock sites, they often contain added metatags appended, and most lack audio. Another valuable recent dataset is Spoken Moments in Time [56], however this was created with significant manual effort. The largest video-text dataset by far is HowTo100M [55] generated from ASR in instructional videos; however, this data is particularly noisy, as discussed in the introduction.…”
Section: Retrieval)mentioning
confidence: 99%
“…For visual and speech pair datasets, we turn to the Spoken Moments in Time (SMiT), a video-narration dataset. SMiT comprises 500k spoken captions each of which depicts a broad range of different events in a short video (Monfort et al, 2021).…”
Section: Large-scale Multimodal Pretraining Datamentioning
confidence: 99%