Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions

García, Noa; Nakashima, Yuta

doi:10.1007/978-3-030-58523-5_34

Cited by 21 publications

(28 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For VideaQA, KnowIT VQA [8] is the first unstructured video-based dataset built by humans. ROLL [7] leverages online knowledge to answer questions about video stories, showing the great potential of knowledge-based models in VideoQA.…”

Section: Related Workmentioning

confidence: 99%

“…For example, TVQA [15] presents a large-scale dataset and a model that leverages Faster-RCNN [21] and LSTMs [12] to process visual and language inputs, and the use of attention mechanisms [24] has also achieved a great success [30,34,35]. Recently, a new research direction in VideoQA has emerged, i.e., external knowledge-based VideoQA [7,8], which requires information that cannot be directly obtained from the videos or the question-answer (QA) pairs, and thus, cannot be learned from the dataset. In this task, therefore, a model needs to refer to knowledge from external sources.…”

Section: Introductionmentioning

confidence: 99%

“…Knowledge for VideoQA, including external knowledge-based one, can be obtained in different ways: some knowledge can be learned from the videos and QA pairs in the dataset, and other knowledge can be extracted from a (external) knowledge base. Existing approaches use a dedicated set of sentences, each of which is associated with its respective QA pair [8] or a block of text found in the Internet [7,16], as knowledge base in a specific domain, which are retrieved by the model during inference. This means that the knowledge base can be potentially replaced without re-training.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Transferring Domain-Agnostic Knowledge in Video Question Answering

Wu¹,

García²,

Otani³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Video question answering (VideoQA) is designed to answer a given question based on a relevant video clip. The current available large-scale datasets have made it possible to formulate VideoQA as the joint understanding of visual and language information. However, this training procedure is costly and still less competent with human performance. In this paper, we investigate a transfer learning method by the introduction of domain-agnostic knowledge and domain-specific knowledge. First, we develop a novel transfer learning framework, which finetunes the pre-trained model by applying domainagnostic knowledge as the medium. Second, we construct a new VideoQA dataset with 21, 412 human-generated question-answer samples for comparable transfer of knowledge. Our experiments show that: (i) domain-agnostic knowledge is transferable and (ii) our proposed transfer learning framework can boost VideoQA performance effectively.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Transferring Domain-Agnostic Knowledge in Video Question Answering

Wu¹,

García²,

Otani³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…A. ROLL [16] The aim of VQA application has not only limited to images, but also videos. Inspired by human behavior of constantly reasoning over the communications and actions through the storyline in the movie, model ROLL aims to leverage tasks of dialog comprehension, scene reasoning, and storyline recalling, with access to external resources to retrieve contextual information.…”

Section: Applicationmentioning

confidence: 99%

“…In addition, to enhance the efficiency of visual question answering, the multimodal information fusion mechanisms such as BLOCK [13], grid-feature [14], and DACT [15] were proposed. Besides, based on the video question answering datasets and scientific diagram based datasets, video question answering models, such as ROLL [16] and hstar [17], as well as scientific diagram analyzing models have been established. This paper would review the existing datasets, metrics, and models of VQA and analyze their progress and remaining problems.…”

Section: Introductionmentioning

confidence: 99%

A Survey on VQA: Datasets and Approaches

Zou

Xie

2020

2020 2nd International Conference on Information Technology and Computer Application (ITCA)

View full text Add to dashboard Cite

Visual question answering (VQA) is a task that combines both the techniques of computer vision and natural language processing. It requires models to answer a text-based question according to the information contained in a visual. In recent years, the research field of VQA has been expanded. Research that focuses on the VQA, examining the reasoning ability and VQA on scientific diagrams, has also been explored more. Meanwhile, more multimodal feature fusion mechanisms have been proposed. This paper will review and analyze existing datasets, metrics, and models proposed for the VQA task.

show abstract