2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01589
|View full text |Cite
|
Sign up to set email alerts
|

MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
64
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 102 publications
(88 citation statements)
references
References 65 publications
0
64
0
Order By: Relevance
“…In contrast to the rapid progress on developing large-scale image-text pre-training datasets, videotext pre-training datasets are harder to collect and often noisier. Most of the video datasets (Miech et al, 2019;Zellers et al, 2021Zellers et al, , 2022 stem from YouTube (Figure 5.5a). YouTube videos are usually long, with a duration of 6 minutes on average.…”
Section: Pre-training Datasetsmentioning
confidence: 99%
See 4 more Smart Citations
“…In contrast to the rapid progress on developing large-scale image-text pre-training datasets, videotext pre-training datasets are harder to collect and often noisier. Most of the video datasets (Miech et al, 2019;Zellers et al, 2021Zellers et al, , 2022 stem from YouTube (Figure 5.5a). YouTube videos are usually long, with a duration of 6 minutes on average.…”
Section: Pre-training Datasetsmentioning
confidence: 99%
“…The total 6M videos are cut into 180M short clips based on the predicted punctuation added to the ASR texts, which may suggest a sentence ending. This dataset is further augmented with the audio modality and scaled up to 1B (in # frame-text-audio triplets), namely YTTemporal-1B in Zellers et al (2022). • WebVid2.5M (Bain et al, 2021) is inspired by the web-crawled image-text dataset Conceptual Captions (CC3M) (Sharma et al, 2018).…”
Section: Pre-training Datasetsmentioning
confidence: 99%
See 3 more Smart Citations