MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound

Zellers, Rowan; Lü, Jing; Lu, Ximing; Yu, Yang; Zhao, Yanpeng; Salehi, Mohammadreza; Kusupati, Aditya; Hessel, Jack; Farhadi, Ali; Choi, Yejin

doi:10.1109/cvpr52688.2022.01589

Cited by 102 publications

(88 citation statements)

References 65 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast to the rapid progress on developing large-scale image-text pre-training datasets, videotext pre-training datasets are harder to collect and often noisier. Most of the video datasets (Miech et al, 2019;Zellers et al, 2021Zellers et al, , 2022 stem from YouTube (Figure 5.5a). YouTube videos are usually long, with a duration of 6 minutes on average.…”

Section: Pre-training Datasetsmentioning

confidence: 99%

“…The total 6M videos are cut into 180M short clips based on the predicted punctuation added to the ASR texts, which may suggest a sentence ending. This dataset is further augmented with the audio modality and scaled up to 1B (in # frame-text-audio triplets), namely YTTemporal-1B in Zellers et al (2022). • WebVid2.5M (Bain et al, 2021) is inspired by the web-crawled image-text dataset Conceptual Captions (CC3M) (Sharma et al, 2018).…”

Section: Pre-training Datasetsmentioning

confidence: 99%

“…• Video Source: The TV Dataset (Lei et al, 2018) is sourced from popular TV shows while all other datasets are crawled from Internet. It is worth noting that the large-scale datasets (e.g., HowTo100M (Miech et al, 2019), HD-VILA-100M (Xue et al, 2022), YTTemporal (Zellers et al, 2021(Zellers et al, , 2022) are mostly based on YouTube videos.…”

Section: Pre-training Datasetsmentioning

confidence: 99%

“…The proposed MV-GPT model is based on the Transformer architecture and can be end-toend trained. MERLOT-Reserve (Zellers et al, 2022) similarly adopts the Transformer architecture, but takes all three channel inputs (visual frames, subtitles and audio). An important finding of the MERLOT-Reserve study is that video-text pre-training with audio can help visual commonsense reasoning (Zellers et al, 2019), an audio-less image-text task.…”

Section: Learning From Multi-channel Videosmentioning

confidence: 99%

“…We group popular datasets into (i) Youtube-based datasets ; (ii) Datasets with short videos and alt-texts , and (iii) TV-show based dataset(Lei et al, 2018). Youtube-based datasets include the one used inVideoBERT (Sun et al, 2019a), HowTo100M(Miech et al, 2019), HD-VILA-100M(Xue et al, 2022), YTTemporal-180M(Zellers et al, 2021) and YTTemporal-1B(Zellers et al, 2022). Datasets with short videos and alt-texts cover AutoGIF(Pan et al, 2020a), WebVid-2.5M and WebVid-10M(Bain et al, 2021).…”

mentioning

confidence: 99%

See 4 more Smart Citations

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Gan¹,

Fu²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: (i) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; (ii) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and (iii) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.♠ Zhe Gan and Jianfeng Gao initiated the project. Zhe Gan and Linjie Li took lead in the writing of Chapter 1. Linjie Li and Jianfeng Gao took lead in the writing of Chapter 2. Zhe Gan further took lead in the writing of Chapter 3 and 7. Chunyuan Li took lead in the writing of Chapter 4. Linjie Li further took lead in the writing of Chapter 5. Lijuan Wang and Zicheng Liu took lead in the writing of Chapter 6. All the authors provided project advice, and contributed to paper editing and proofreading.

show abstract

Section: Pre-training Datasetsmentioning

confidence: 99%

Section: Pre-training Datasetsmentioning

confidence: 99%

Section: Pre-training Datasetsmentioning

confidence: 99%

Section: Learning From Multi-channel Videosmentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Gan¹,

Fu²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Video Turing Test: A first step towards human‐level AI

Lee,

Heo,

Choi

et al. 2023

AI Magazine

View full text Add to dashboard Cite

The development of artificial intelligence (AI) agents capable of human‐level understanding of video content and conducting conversations with humans on this basis is a promising application that people expect. However, this is a challenging task that requires the holistic integration of multimodal information with temporal dependencies and reasoning, as well as social and physical commonsense. In addition, the development of appropriate systematic evaluation methods is essential. In this context, we introduce the Video Turing Test (VTT), a blind test used to evaluate human‐likeness in terms of video comprehension ability. Moreover, we propose Vincent as a video understanding AI. We explain the configuration of VTT, the architecture of Vincent to prepare for VTT and the proposed evaluation methods for video comprehension. We also estimate the current intelligence level of AI based on our results and discuss future research directions.

show abstract

A Competence-Aware Curriculum for Visual Concepts Learning via Question Answering

Li¹,

Huang²,

Hong³

et al. 2020

Computer Vision – ECCV 2020

View full text Add to dashboard Cite

While significant advancements have been made in video question answering (VideoQA), the potential benefits of enhancing model generalization through tailored difficulty scheduling have been largely overlooked in existing research. This paper seeks to bridge that gap by incorporating VideoQA into a curriculum learning (CL) framework that progressively trains models from simpler to more complex data. Recognizing that conventional self-paced CL methods rely on training loss for difficulty measurement, which might not accurately reflect the intricacies of video-question pairs, we introduce the concept of uncertainty-aware CL. Here, uncertainty serves as the guiding principle for dynamically adjusting the difficulty. Furthermore, we address the challenge posed by uncertainty by presenting a probabilistic modeling approach for VideoQA. Specifically, we conceptualize VideoQA as a stochastic computation graph, where the hidden representations are treated as stochastic variables. This yields two distinct types of uncertainty: one related to the inherent uncertainty in the data and another pertaining to the model's confidence. In practice, we seamlessly integrate the VideoQA model into our framework and conduct comprehensive experiments. The findings affirm that our approach not only achieves enhanced performance but also effectively quantifies uncertainty in the context of VideoQA.

show abstract

MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound

Cited by 102 publications

References 65 publications

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Video Turing Test: A first step towards human‐level AI

A Competence-Aware Curriculum for Visual Concepts Learning via Question Answering

Contact Info

Product

Resources

About