Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents

Wang, Bo; Xu, Youjiang; Han, Yahong; Hong, Richang

doi:10.48550/arxiv.1804.09412

Cited by 2 publications

(18 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Video QA Methods. Extensive studies have been conducted for video QA [61,12,17,23,24,31,49,11,25,60,46,51,8,33,63,66]. Yu et al [61] employed LSTM to encode videos and QA pairs, and adopted an attention mechanism [58].…”

Section: Related Workmentioning

confidence: 99%

SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

Li¹,

He²,

Li³

2021

Preprint

View full text Add to dashboard Cite

Traffic event cognition and reasoning in videos is an important task that has a wide range of applications in intelligent transportation, assisted driving, and autonomous vehicles. In this paper, we create a novel dataset, TrafficQA (Traffic Question Answering), which takes the form of video QA based on the collected 10,080 in-the-wild videos and annotated 62,535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios. Specifically, we propose 6 challenging reasoning tasks corresponding to various traffic scenarios, so as to evaluate the reasoning capability over different kinds of complex yet practical traffic events. Moreover, we propose Eclipse, a novel Efficient glimpse network via dynamic inference, in order to achieve computation-efficient and reliable video reasoning. The experiments show that our method achieves superior performance while reducing the computation cost significantly. The project page: https://github.com/SUTDCV/ SUTD-TrafficQA.

show abstract

Section: Related Workmentioning

confidence: 99%

SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

Li¹,

He²,

Li³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…[Na et al, 2017] provides a read-write memory network where the read network and write network consist of multiple convolutional layers, which enable memory read and write operations to have high capacity and flexibility. [Wang et al, 2018a] focuses on the video representation, and puts forward a layered memory network to represent frame-level and clip-level movie content by a static word memory module and another dynamic subtitle memory module respectively.…”

Section: Multimodal Question Answeringmentioning

confidence: 99%

“…Compared with typical multimodal VQA, the problem of cross-modal gap gets more serious in the context of multimodal MovieQA. Firstly, different movies may have different background, sheme, and shooting style, which makes it difficult for learning robust multimodal representations [Wang et al, 2018a]. Secondly, it is common that visual clips and textual subtitles are not aligned in the time coordinate.…”

Section: Introductionmentioning

confidence: 99%

“…The first one is based on embedding mapping, which learns a visual embedding matrix and takes the sum of visual embedded representations and textual features as the combined multimodal representations [Kim et al, 2017]. The second kind of methods are based on attention mechanism, which attend to the textual memory of the words and sentences for each visual regional features [Tapaswi et al, 2016;Wang et al, 2018a]. The third kind methods are based on compact bilinear pooling [Fukui et al, 2016], which compute a compact analogous outer product between visual features and textual features to get a joint representation [Na et al, 2017].…”

Section: Introductionmentioning

confidence: 99%

“…The third kind methods are based on compact bilinear pooling [Fukui et al, 2016], which compute a compact analogous outer product between visual features and textual features to get a joint representation [Na et al, 2017]. Among the methods mentioned above, the attention-based models achieve better performance, which project visual representations into textual space [Wang et al, 2018a]. We think this benefits from the textual forms of the combined representations.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Adversarial Multimodal Network for Movie Question Answering

Yuan,

Sun,

Duan

et al. 2019

Preprint

View full text Add to dashboard Cite

Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, as inspired by generative adversarial networks, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions). Moreover, we introduce a self-attention mechanism to enforce the so-called consistency constraints in order to preserve the selfcorrelation of visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the MovieQA dataset show the effectiveness of our proposed AMN over other published state-of-the-art methods.

show abstract

Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents

Cited by 2 publications

References 25 publications

SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events

Adversarial Multimodal Network for Movie Question Answering

Contact Info

Product

Resources

About