2020
DOI: 10.1609/aaai.v34i07.6713
|View full text |Cite
|
Sign up to set email alerts
|

KnowIT VQA: Answering Knowledge-Based Questions about Videos

Abstract: We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered. Second, we propose a video understanding model by combining the visual and textual video content … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
66
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 60 publications
(69 citation statements)
references
References 25 publications
0
66
0
Order By: Relevance
“…As a typical multimodal task, VideoQA requires thorough visual and textual understanding. In recent years, some more restricted sub-tasks have also been proposed to enhance the interpretability, such as Knowledge-based VideoQA [9] and Spatio-temporal grounding VideoQA [20]. Nevertheless, the VideoQA framework generally consists of a video encoder, a question encoder, an embedding alignment module, and a predictor.…”
Section: Related Work 21 Video Question Answeringmentioning
confidence: 99%
“…As a typical multimodal task, VideoQA requires thorough visual and textual understanding. In recent years, some more restricted sub-tasks have also been proposed to enhance the interpretability, such as Knowledge-based VideoQA [9] and Spatio-temporal grounding VideoQA [20]. Nevertheless, the VideoQA framework generally consists of a video encoder, a question encoder, an embedding alignment module, and a predictor.…”
Section: Related Work 21 Video Question Answeringmentioning
confidence: 99%
“…In addition, to better understand the video content, which is usually a kind of multimodal data including visual and linguistic information, the extra linguistic information, such as subtitles [7], [8], captions [22], [23], and knowledge [9], [24], are introduced to VideoQA tasks. Our work aims to handle such generalized VideoQA tasks that consider both visual and linguistic information, which is more practical than those visual-specific VideoQA tasks.…”
Section: A Video Question Answeringmentioning
confidence: 99%
“…Many tasks have been proposed to evaluate such ability, and visual question answering is one of those tasks (Antol et al, 2015;Lu et al, 2016;Fukui et al, 2016;Xu and Saenko, 2016;Goyal et al, 2017;Anderson et al, 2018). Recently, beyond question answering on a single image, attention to understanding and extracting information from a sequence of images, i.e., a video, is rising (Tapaswi et al, 2016;Maharaj et al, 2017;Jang et al, 2017;Zadeh et al, 2019;Lei et al, 2020;Garcia et al, 2020). Answering questions on videos requires an ... 0 0 0 0 1 1 1 1 1 0 0 0 ... Frame-Level Att.…”
Section: Related Workmentioning
confidence: 99%