2018
DOI: 10.48550/arxiv.1804.09412
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents

Abstract: Movies provide us with a mass of visual content as well as attracting stories. Existing methods have illustrated that understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we put forward a Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module, respectively. Particularly, we firstly extract words and sentences from the training mo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(18 citation statements)
references
References 25 publications
0
18
0
Order By: Relevance
“…Video QA Methods. Extensive studies have been conducted for video QA [61,12,17,23,24,31,49,11,25,60,46,51,8,33,63,66]. Yu et al [61] employed LSTM to encode videos and QA pairs, and adopted an attention mechanism [58].…”
Section: Related Workmentioning
confidence: 99%
“…Video QA Methods. Extensive studies have been conducted for video QA [61,12,17,23,24,31,49,11,25,60,46,51,8,33,63,66]. Yu et al [61] employed LSTM to encode videos and QA pairs, and adopted an attention mechanism [58].…”
Section: Related Workmentioning
confidence: 99%
“…[Na et al, 2017] provides a read-write memory network where the read network and write network consist of multiple convolutional layers, which enable memory read and write operations to have high capacity and flexibility. [Wang et al, 2018a] focuses on the video representation, and puts forward a layered memory network to represent frame-level and clip-level movie content by a static word memory module and another dynamic subtitle memory module respectively.…”
Section: Multimodal Question Answeringmentioning
confidence: 99%
“…Compared with typical multimodal VQA, the problem of cross-modal gap gets more serious in the context of multimodal MovieQA. Firstly, different movies may have different background, sheme, and shooting style, which makes it difficult for learning robust multimodal representations [Wang et al, 2018a]. Secondly, it is common that visual clips and textual subtitles are not aligned in the time coordinate.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations