KnowIT VQA: Answering Knowledge-Based Questions about Videos

García, Noa; Otani, Mayu; Chu, Chenhui; Nakashima, Yuta

doi:10.1609/aaai.v34i07.6713

Cited by 60 publications

(69 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a typical multimodal task, VideoQA requires thorough visual and textual understanding. In recent years, some more restricted sub-tasks have also been proposed to enhance the interpretability, such as Knowledge-based VideoQA [9] and Spatio-temporal grounding VideoQA [20]. Nevertheless, the VideoQA framework generally consists of a video encoder, a question encoder, an embedding alignment module, and a predictor.…”

Section: Related Work 21 Video Question Answeringmentioning

confidence: 99%

Relation-aware Hierarchical Attention Framework for Video Question Answering

Bai

Cao

et al. 2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Video Question Answering (VideoQA) is a challenging video understanding task since it requires a deep understanding of both question and video. Previous studies mainly focus on extracting sophisticated visual and language embeddings, fusing them by delicate hand-crafted networks. However, the relevance of different frames, objects, and modalities to the question are varied along with the time, which is ignored in most of existing methods. Lacking understanding of the the dynamic relationships and interactions among objects brings a great challenge to VideoQA task. To address this problem, we propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos. In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features. Then a graph-based relation encoder is utilized to extract the static relationship between visual objects. To capture the dynamic changes of multimodal objects in different video frames, we consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer. We conduct extensive experiments on a large scale VideoQA dataset, and the experimental results demonstrate that our RHA outperforms the state-of-the-art methods. CCS CONCEPTS• Information systems → Question answering; • Computing methodologies → Computer vision.

show abstract

Section: Related Work 21 Video Question Answeringmentioning

confidence: 99%

Relation-aware Hierarchical Attention Framework for Video Question Answering

Bai

Cao

et al. 2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

show abstract

“…In addition, to better understand the video content, which is usually a kind of multimodal data including visual and linguistic information, the extra linguistic information, such as subtitles [7], [8], captions [22], [23], and knowledge [9], [24], are introduced to VideoQA tasks. Our work aims to handle such generalized VideoQA tasks that consider both visual and linguistic information, which is more practical than those visual-specific VideoQA tasks.…”

Section: A Video Question Answeringmentioning

confidence: 99%

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Jiang¹,

Liu²,

Zheng³

2021

Preprint

View full text Add to dashboard Cite

Video Question Answering (VideoQA), aiming to correctly answer the given question based on understanding multi-modal video content, is challenging due to the rich video content. From the perspective of video understanding, a good VideoQA framework needs to understand the video content at different semantic levels and flexibly integrate the diverse video content to distill question-related content. To this end, we propose a Lightweight Visual-Linguistic Reasoning framework named LiVLR. Specifically, LiVLR first utilizes the graph-based Visual and Linguistic Encoders to obtain multi-grained visual and linguistic representations. Subsequently, the obtained representations are integrated with the devised Diversity-aware Visual-Linguistic Reasoning module (DaVL). The DaVL considers the difference between the different types of representations and can flexibly adjust the importance of different types of representations when generating the question-related joint representation, which is an effective and general representation integration method. The proposed LiVLR is lightweight and shows its performance advantage on two VideoQA benchmarks, MRSVTT-QA and KnowIT VQA. Extensive ablation studies demonstrate the effectiveness of LiVLR key components.

show abstract

“…Many tasks have been proposed to evaluate such ability, and visual question answering is one of those tasks (Antol et al, 2015;Lu et al, 2016;Fukui et al, 2016;Xu and Saenko, 2016;Goyal et al, 2017;Anderson et al, 2018). Recently, beyond question answering on a single image, attention to understanding and extracting information from a sequence of images, i.e., a video, is rising (Tapaswi et al, 2016;Maharaj et al, 2017;Jang et al, 2017;Zadeh et al, 2019;Lei et al, 2020;Garcia et al, 2020). Answering questions on videos requires an ... 0 0 0 0 1 1 1 1 1 0 0 0 ... Frame-Level Att.…”

Section: Related Workmentioning

confidence: 99%

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Kim¹,

Tang²,

Bansal³

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Videos convey rich information. Dynamic spatio-temporal relationships between people/objects, and diverse multimodal events are present in a video clip. Hence, it is important to develop automated models that can accurately extract such information from videos. Answering questions on videos is one of the tasks which can evaluate such AI abilities. In this paper, we propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Specifically, we first employ dense image captions to help identify objects and their detailed salient regions and actions, and hence give the model useful extra information (in explicit textual format to allow easier matching) for answering questions. Moreover, our model is also comprised of duallevel attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier. Finally, we also cast the frame selection problem as a multi-label classification task and introduce two loss functions, In-and-Out Frame Score Margin (IOFSM) and Balanced Binary Cross-Entropy (BBCE), to better supervise the model with human importance annotations. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the stateof-the-art by a large margin (74.09% versus 70.52%). We also present several word, object, and frame level visualization studies. 1Local Gate Frame Score Margin Inside Frames Outside FramesFrame-Level Att.

show abstract

KnowIT VQA: Answering Knowledge-Based Questions about Videos

Cited by 60 publications

References 25 publications

Relation-aware Hierarchical Attention Framework for Video Question Answering

Relation-aware Hierarchical Attention Framework for Video Question Answering

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Contact Info

Product

Resources

About