2017
DOI: 10.1609/aaai.v31i1.11238
|View full text |Cite
|
Sign up to set email alerts
|

Leveraging Video Descriptions to Learn Video Question Answering

Abstract: We propose a scalable approach to learn video-based question answering (QA): to answer a free-form natural language question about the contents of a video. Our approach automatically harvests a large number of videos and descriptions freely available online. Then, a large number of candidate QA pairs are automatically generated from descriptions rather than manually annotated. Next, we use these candidate QA pairs to train a number of video-based QA methods extended from MN (Sukhbaatar et al. 2015), VQA (Antol… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
19
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 76 publications
(19 citation statements)
references
References 16 publications
0
19
0
Order By: Relevance
“…Canonical approaches use techniques such as cross-model attention (Jang et al 2017;Zeng et al 2017;Li et al 2019b;Jin et al 2019;Gao et al 2019;) and motion-appearance memory (Xu et al 2017;Gao et al 2018;Fan et al 2019) to fuse information from the video and question for answer prediction. These methods focus on designing sophisticated cross-modal interactions, whereas treating video and question as a holistic sequence of frames and words respectively.…”
Section: Related Workmentioning
confidence: 99%
“…Canonical approaches use techniques such as cross-model attention (Jang et al 2017;Zeng et al 2017;Li et al 2019b;Jin et al 2019;Gao et al 2019;) and motion-appearance memory (Xu et al 2017;Gao et al 2018;Fan et al 2019) to fuse information from the video and question for answer prediction. These methods focus on designing sophisticated cross-modal interactions, whereas treating video and question as a holistic sequence of frames and words respectively.…”
Section: Related Workmentioning
confidence: 99%
“…On top of these representations, the decoder learns the visual-linguistic alignments to generate the answer. In particular, the alignments are modeled via crossmodal interaction like graph alignment [19], cross-attention [14,15,18,39] and co-memory [10], etc.…”
Section: Preliminariesmentioning
confidence: 99%
“…As videos could be taken as spatio-temporal extensions of images, how to incorporate the temporal cues into the video representation and associate it to certain textual cues in questions is crucial to the video question answering Zeng et al 2017). As videos are more complex than images, datasets construction to boost the research of video question answering is a challenge task, such as TGIF-QA (Jang et al 2017), MarioQA (Mun et al 2017), the 'fill-inthe-blank' ) and the large-scale video question answering dataset without manual annotations (Zeng et al 2017).…”
Section: Video Question Answeringmentioning
confidence: 99%
“…As videos could be taken as spatio-temporal extensions of images, how to incorporate the temporal cues into the video representation and associate it to certain textual cues in questions is crucial to the video question answering Zeng et al 2017). As videos are more complex than images, datasets construction to boost the research of video question answering is a challenge task, such as TGIF-QA (Jang et al 2017), MarioQA (Mun et al 2017), the 'fill-inthe-blank' ) and the large-scale video question answering dataset without manual annotations (Zeng et al 2017). Recently, Tapaswi et al (Tapaswi et al 2016) proposed a MovieQA dataset with multiple sources for movie question answering, which successfully attracted interesting work, such as video-story learning (Kim et al 2017) and multi-modal movie question answering (Na et al 2017).…”
Section: Video Question Answeringmentioning
confidence: 99%