2017
DOI: 10.1109/tip.2017.2746267
|View full text |Cite
|
Sign up to set email alerts
|

Unifying the Video and Question Attentions for Open-Ended Video Question Answering

Abstract: Video question answering is an important task toward scene understanding and visual data retrieval. However, current visual question answering works mainly focus on a single static image, which is distinct from the dynamic and sequential visual data in the real world. Their approaches cannot utilize the temporal information in videos. In this paper, we introduce the task of free-form open-ended video question answering. The open-ended answers enable wider applications compared with the common multiple-choice t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
22
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 57 publications
(22 citation statements)
references
References 17 publications
0
22
0
Order By: Relevance
“…Lee et al [22] propose the Stacked Cross Attention Network (SCAN), which discovers the cross-modal alignments by a fine-grained attention scheme on regions in image and words in sentence. Beyond the fundamental imagetext matching, there are more emerging and attractive applications related to visual-semantic embedding, such as image captioning [42], [33], [27], [2] and visual question answering [3], [28], [41], [38], [44]. Anderson et al Unlike them, the fine-grained problem is the major difficulty in distinguishing different people in the description-based person Re-id, which needs to be carefully addressed.…”
Section: Related Work a Visual-semantic Embeddingmentioning
confidence: 99%
“…Lee et al [22] propose the Stacked Cross Attention Network (SCAN), which discovers the cross-modal alignments by a fine-grained attention scheme on regions in image and words in sentence. Beyond the fundamental imagetext matching, there are more emerging and attractive applications related to visual-semantic embedding, such as image captioning [42], [33], [27], [2] and visual question answering [3], [28], [41], [38], [44]. Anderson et al Unlike them, the fine-grained problem is the major difficulty in distinguishing different people in the description-based person Re-id, which needs to be carefully addressed.…”
Section: Related Work a Visual-semantic Embeddingmentioning
confidence: 99%
“…Xue et al [83] also created a new dataset using the TGIF video captioning dataset. But their dataset is designed to capture the open-ended question answers.…”
Section: ) Encoder-decoder Based Methodsmentioning
confidence: 99%
“…MSVD-QA MSVD-QA dataset [83] is based on MSRVD dataset of video captioning and utilizes the video captions for automatically generating questions of the type "what, who, how, when and where".…”
Section: Datasetmentioning
confidence: 99%
See 1 more Smart Citation
“…Generally successes and advancements in video/image captioning and attention mechanisms provide new research direction to solve the VideoQA task. An encoder-decoder based approach is proposed in [11], where unification of attentions is performed by considering both the quesiton sentence and the video. Frame-based visual attributes and question sentence based textual attributes are jointly learned in the approach proposed in [12].…”
Section: Related Workmentioning
confidence: 99%