Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475193
|View full text |Cite
|
Sign up to set email alerts
|

Progressive Graph Attention Network for Video Question Answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 28 publications
(13 citation statements)
references
References 24 publications
0
13
0
Order By: Relevance
“…MERLOT [70]) by about 1.0%, yet without using external data for cross-modal pretraining. On TGIF-QA-R [42] which is curated by making the negative answers in TGIF-QA more challenging, we can also observe remarkable improvements. Besides, VGT also achieves competitive results on normal descriptive QA tasks as defined in FrameQA and MSRVTT-QA though they are not our focus.…”
Section: Sate-of-the-art Comparisonmentioning
confidence: 85%
See 3 more Smart Citations
“…MERLOT [70]) by about 1.0%, yet without using external data for cross-modal pretraining. On TGIF-QA-R [42] which is curated by making the negative answers in TGIF-QA more challenging, we can also observe remarkable improvements. Besides, VGT also achieves competitive results on normal descriptive QA tasks as defined in FrameQA and MSRVTT-QA though they are not our focus.…”
Section: Sate-of-the-art Comparisonmentioning
confidence: 85%
“…Recently, graphs constructed over object-level representations [19,36,47,60] have demonstrated superior performance, especially on benchmarks that emphasize visual relation reasoning [20,49,50,59]. However, these graph methods either construct monolithic graphs that do not disambiguate between relations in 1) space and time, 2) local and global scopes [19,57], or build static graphs at frame-level without explicitly capturing the temporal dynamics [36,42,60]. The monolithic graph is cumbersome to long videos where multiple objects interact in space-time.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…In contrast, although concurrent work (HGA) by [Jiang and Han, 2020], and more recent works, B2A by [Park et al, 2021] and DualVGR by build the graph based on more coarsegrained video elements and word, they incorporate both intramodal and inter-modal relationship learning and achieve better performance. Considering the video elements are hierarchical in semantic space, [Liu et al, 2021a], [Peng et al, 2021] and separately incorporate hierarchical learning idea into graph networks. Specifically, [Liu et al, 2021a] propose a graph memory mechanism (HAIR), to perform relational vision-semantic reasoning from object level to frame level; [Peng et al, 2021] concatenate differentlevel graphs, that is, object-level, frame-level and clip-level, in a progressive manner to learn the visual relations (PGAT); while propose a hierarchical conditional graph model (HQGA) to weave together visual facts from low-level entities to higher level video elements through graph aggregation and pooling, to enables vision-text matching at multi-granularity levels.…”
Section: Methodsmentioning
confidence: 99%