2020
DOI: 10.1007/978-3-030-58523-5_34
|View full text |Cite
|
Sign up to set email alerts
|

Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
28
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 21 publications
(28 citation statements)
references
References 53 publications
0
28
0
Order By: Relevance
“…For VideaQA, KnowIT VQA [8] is the first unstructured video-based dataset built by humans. ROLL [7] leverages online knowledge to answer questions about video stories, showing the great potential of knowledge-based models in VideoQA.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…For VideaQA, KnowIT VQA [8] is the first unstructured video-based dataset built by humans. ROLL [7] leverages online knowledge to answer questions about video stories, showing the great potential of knowledge-based models in VideoQA.…”
Section: Related Workmentioning
confidence: 99%
“…For example, TVQA [15] presents a large-scale dataset and a model that leverages Faster-RCNN [21] and LSTMs [12] to process visual and language inputs, and the use of attention mechanisms [24] has also achieved a great success [30,34,35]. Recently, a new research direction in VideoQA has emerged, i.e., external knowledge-based VideoQA [7,8], which requires information that cannot be directly obtained from the videos or the question-answer (QA) pairs, and thus, cannot be learned from the dataset. In this task, therefore, a model needs to refer to knowledge from external sources.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…A. ROLL [16] The aim of VQA application has not only limited to images, but also videos. Inspired by human behavior of constantly reasoning over the communications and actions through the storyline in the movie, model ROLL aims to leverage tasks of dialog comprehension, scene reasoning, and storyline recalling, with access to external resources to retrieve contextual information.…”
Section: Applicationmentioning
confidence: 99%
“…In addition, to enhance the efficiency of visual question answering, the multimodal information fusion mechanisms such as BLOCK [13], grid-feature [14], and DACT [15] were proposed. Besides, based on the video question answering datasets and scientific diagram based datasets, video question answering models, such as ROLL [16] and hstar [17], as well as scientific diagram analyzing models have been established. This paper would review the existing datasets, metrics, and models of VQA and analyze their progress and remaining problems.…”
Section: Introductionmentioning
confidence: 99%