“…In contrast, although concurrent work (HGA) by [Jiang and Han, 2020], and more recent works, B2A by [Park et al, 2021] and DualVGR by build the graph based on more coarsegrained video elements and word, they incorporate both intramodal and inter-modal relationship learning and achieve better performance. Considering the video elements are hierarchical in semantic space, [Liu et al, 2021a], [Peng et al, 2021] and separately incorporate hierarchical learning idea into graph networks. Specifically, [Liu et al, 2021a] propose a graph memory mechanism (HAIR), to perform relational vision-semantic reasoning from object level to frame level; [Peng et al, 2021] concatenate differentlevel graphs, that is, object-level, frame-level and clip-level, in a progressive manner to learn the visual relations (PGAT); while propose a hierarchical conditional graph model (HQGA) to weave together visual facts from low-level entities to higher level video elements through graph aggregation and pooling, to enables vision-text matching at multi-granularity levels.…”