2020
DOI: 10.48550/arxiv.2002.06353
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Huaishao Luo,
Lei Ji,
Botian Shi
et al.

Abstract: We propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Motivated by the recent success of BERT based pre-training technique for NLP and image-language tasks, VideoBERT and CBT are proposed to exploit BERT model for video and language pretraining using narrated instructional videos. Different from their works which only pretrain understanding task, we propose a unified video-language pre-training model for both understanding and generation tasks. Our mod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

2
221
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 95 publications
(223 citation statements)
references
References 22 publications
2
221
0
Order By: Relevance
“…In recent years, significant progress has been achieved due to the introduction of large-scale language-vision datasets and the development of efficient deep neural techniques that bridge the gap between language and visual understanding. Improvements have been made in numerous vision-and-language tasks, such as visual captioning [1,2], visual question answering [3], and natural language video localization [4,5,6]. In recent years there has been an increasing interest in video question-answering [7,8] tasks, where given a video, the systems are expected to retrieve the answer to a natural language question about the content in the video.…”
Section: Introductionmentioning
confidence: 99%
“…In recent years, significant progress has been achieved due to the introduction of large-scale language-vision datasets and the development of efficient deep neural techniques that bridge the gap between language and visual understanding. Improvements have been made in numerous vision-and-language tasks, such as visual captioning [1,2], visual question answering [3], and natural language video localization [4,5,6]. In recent years there has been an increasing interest in video question-answering [7,8] tasks, where given a video, the systems are expected to retrieve the answer to a natural language question about the content in the video.…”
Section: Introductionmentioning
confidence: 99%
“…An instructional or how-to video contains a human subject demonstrating and narrating how to accomplish a certain task. Early works on HowTo100M have focused on leveraging this large collection for learning models that can be transferred to other tasks, such as action recognition [4,37,38], video captioning [24,36,66], or text-video retrieval [7,37,61]. The problem of recognizing the task performed in the instructional video has been considered by Bertasius et al [8].…”
Section: Related Workmentioning
confidence: 99%
“…For instance, step localization [59,86,106] as well as action segmentation [27,29,62,86] in instructional videos have been widely studied in the early stage. With the growing attention paid to this research topic, various kinds of tasks related to instructional videos have been proposed, e.g., video captioning [37,57,87,106] which generates the description of a video based on the actions and events, visual grounding [35,77] which locates the target in an image according to the language description, and procedure learning [2,22,27,72,73,106] which extracts key-steps in videos.…”
Section: Related Workmentioning
confidence: 99%