Proceedings of the 27th ACM International Conference on Multimedia 2019
DOI: 10.1145/3343031.3350969
|View full text |Cite
|
Sign up to set email alerts
|

Question-Aware Tube-Switch Network for Video Question Answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 24 publications
(9 citation statements)
references
References 24 publications
0
9
0
Order By: Relevance
“…Developments along this direction include attribute-based attention [42], hierarchical attention [21,45,46], multi-head attention [14,19], multi-step progressive attention memory [12] or combining self-attention with co-attention [20]. For higher order reasoning, question can interact iteratively with video features via episodic memory or through switching mechanism [41]. Multi-step reasoning for VideoQA is also approached by [39] and [30] with refined attention.…”
Section: Related Workmentioning
confidence: 99%
“…Developments along this direction include attribute-based attention [42], hierarchical attention [21,45,46], multi-head attention [14,19], multi-step progressive attention memory [12] or combining self-attention with co-attention [20]. For higher order reasoning, question can interact iteratively with video features via episodic memory or through switching mechanism [41]. Multi-step reasoning for VideoQA is also approached by [39] and [30] with refined attention.…”
Section: Related Workmentioning
confidence: 99%
“…Video Question Answering aims to answer the given question concerning video content. Most current works [1]- [3], [10], [17]- [19] extract holistic visual appearance and motion features to represent video contents and design different attention mechanisms, such as question-guided attention [1], [11] and co-attention [3], [19], to integrate these features. These methods focus on the holistic understanding of video contents, which may neglect meaningful and fine-grained video contents that complicated semantic questions concern.…”
Section: A Video Question Answeringmentioning
confidence: 99%
“…For fair comparison with the state-of-the-art methods (in Section IV) [15], V i a are the pool5 output of ResNet101 and V i m are extracted from ResNeXt-101. We follow the previous work [15,42] to process questions, which embed the words of each question into fixedlength vectors initialized with 300-dimension GloVe [43] as W = {w i : 1 i L q , w i ∈ R 300×1 }, where L q is the length of each question. Then, we pass these word-embeddings into a BiLSTM Network to get context-aware embedding vectors.…”
Section: A Visual and Linguistic Representationmentioning
confidence: 99%
“…We compare our proposed model with state-of-the-art methods (SOTA) on aforementioned datasets. For MSVD-QA and MSRVTT-QA, we compare with most resent SOTA, including HME [51], HGA [17], HCRN [15] and TSN [42].…”
Section: Comparison With the State-of-the-artmentioning
confidence: 99%