UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Luo, Huaishao; Ji, Lei; Shi, Botian; Huang, Haoyang; Duan, Nan; Li, Tianrui; Li, Jason; Bharti, Taroon; Zhou, Ming

doi:10.48550/arxiv.2002.06353

Cited by 95 publications

(223 citation statements)

References 22 publications

Supporting

Mentioning

221

Contrasting

Order By: Relevance

“…In recent years, significant progress has been achieved due to the introduction of large-scale language-vision datasets and the development of efficient deep neural techniques that bridge the gap between language and visual understanding. Improvements have been made in numerous vision-and-language tasks, such as visual captioning [1,2], visual question answering [3], and natural language video localization [4,5,6]. In recent years there has been an increasing interest in video question-answering [7,8] tasks, where given a video, the systems are expected to retrieve the answer to a natural language question about the content in the video.…”

Section: Introductionmentioning

confidence: 99%

A Dataset for Medical Instructional Video Classification and Question Answering

Gupta¹,

Attal²,

Demner‐Fushman³

2022

Preprint

View full text Add to dashboard Cite

This paper introduces a new challenge and datasets to foster research toward designing systems that can understand medical videos and provide visual answers to natural language questions. We believe medical videos may provide the best possible answers to many first aid, medical emergency, and medical education questions. Toward this, we created the MedVidCL and MedVidQA datasets and introduce the tasks of Medical Video Classification (MVC) and Medical Visual Answer Localization (MVAL), two tasks that focus on cross-modal (medical language and medical video) understanding. The proposed tasks and datasets have the potential to support the development of sophisticated downstream applications that can benefit the public and medical practitioners. Our datasets consist of 6, 117 annotated videos for the MVC task and 3, 010 annotated questions and answers timestamps from 899 videos for the MVAL task. These datasets have been verified and corrected by medical informatics experts. We have also benchmarked 1 each task with the created MedVidCL and MedVidQA datasets and propose the multimodal learning methods that set competitive baselines for future research.

show abstract

Section: Introductionmentioning

confidence: 99%

A Dataset for Medical Instructional Video Classification and Question Answering

Gupta¹,

Attal²,

Demner‐Fushman³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…An instructional or how-to video contains a human subject demonstrating and narrating how to accomplish a certain task. Early works on HowTo100M have focused on leveraging this large collection for learning models that can be transferred to other tasks, such as action recognition [4,37,38], video captioning [24,36,66], or text-video retrieval [7,37,61]. The problem of recognizing the task performed in the instructional video has been considered by Bertasius et al [8].…”

Section: Related Workmentioning

confidence: 99%

Learning To Recognize Procedural Activities with Distant Supervision

Lin¹,

Petroni²,

Bertasius³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal dependencies. This problem is dramatically different from traditional action classification, where models are typically optimized on videos that span only a few seconds and that are manually trimmed to contain simple atomic actions. While step annotations could enable the training of models to recognize the individual steps of procedural activities, existing large-scale datasets in this area do not include such segment labels due to the prohibitive cost of manually annotating temporal boundaries in long videos. To address this issue, we propose to automatically identify steps in instructional videos by leveraging the distant supervision of a textual knowledge base (wikiHow) that includes detailed descriptions of the steps needed for the execution of a wide variety of complex activities. Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base. We demonstrate that video models trained to recognize these automatically-labeled steps (without manual supervision) yield a representation that achieves superior generalization performance on four downstream tasks: recognition of procedural activities, step classification, step forecasting and egocentric video classification.

show abstract

“…For instance, step localization [59,86,106] as well as action segmentation [27,29,62,86] in instructional videos have been widely studied in the early stage. With the growing attention paid to this research topic, various kinds of tasks related to instructional videos have been proposed, e.g., video captioning [37,57,87,106] which generates the description of a video based on the actions and events, visual grounding [35,77] which locates the target in an image according to the language description, and procedure learning [2,22,27,72,73,106] which extracts key-steps in videos.…”

Section: Related Workmentioning

confidence: 99%

SVIP: Sequence VerIfication for Procedures in Videos

Qian¹,

Luo²,

Lian³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations but still conducting the same task. Such a challenging task resides in an open-set setting without prior action detection or segmentation that requires event-level or even frame-level annotations. To that end, we carefully reorganize two publicly available action-related datasets with step-procedure-task structure. To fully investigate the effectiveness of any method, we collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments. Besides, a novel evaluation metric Weighted Distance Ratio is introduced to ensure equivalence for different step-level transformations during evaluation. In the end, a simple but effective baseline based on the transformer with a novel sequence alignment loss is introduced to better characterize long-term dependency between steps, which outperforms other action recognition methods. Codes and data will be released.

show abstract

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Cited by 95 publications

References 22 publications

A Dataset for Medical Instructional Video Classification and Question Answering

A Dataset for Medical Instructional Video Classification and Question Answering

Learning To Recognize Procedural Activities with Distant Supervision

SVIP: Sequence VerIfication for Procedures in Videos

Contact Info

Product

Resources

About