Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3414026
|View full text |Cite
|
Sign up to set email alerts
|

Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization

Abstract: Query-based moment localization is a new task that localizes the best matched segment in an untrimmed video according to a given sentence query. In this localization task, one should pay more attention to thoroughly mine visual and linguistic information. To this end, we propose a novel Cross-and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph. Specifically, the joint graph consists of Cross-Modal relation Graph (CMG) and Self-Mod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
61
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 105 publications
(62 citation statements)
references
References 44 publications
(92 reference statements)
0
61
0
Order By: Relevance
“…Besides, the performance of ExCL reduces when the metric becomes stricter. Even compared to the newly proposed two-stages methods [15,21], our model is competitive. In short, all these experiments show the effectiveness of the proposed method.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 91%
See 2 more Smart Citations
“…Besides, the performance of ExCL reduces when the metric becomes stricter. Even compared to the newly proposed two-stages methods [15,21], our model is competitive. In short, all these experiments show the effectiveness of the proposed method.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 91%
“…The following works [12,15,22] mainly focus on constructing a better interaction model for candidates and query sentence. Jiang et al take advantage of the object-level feature to mine specific details in videos.…”
Section: Temporal Moment Localizationmentioning
confidence: 99%
See 1 more Smart Citation
“…Qu et al [28] proposed the iterative attention module to excavate the grounding clues from both visual and textual modalities. Liu et al [29] reformulated this work as an iterative message passing process over a joint graph that consists of the crossmodal and self-modal relation graphs. Although these methods have achieved good results, they are seriously limited by the quality of candidate proposals and computing cost.…”
Section: A Temporal Sentence Groundingmentioning
confidence: 99%
“…As most videos contain activities of interest with complicated background contents, these videos cannot be directly indicated by a pre-defined list of action classes. Recently, a new task called temporal sentence localization in videos (Gao et al, 2017;Anne Hendricks et al, 2017) is proposed to tackle this problem, attracting great interests from both vision and language communities (Liu et al, 2020;. Given an untrimmed video, this task aims to infer the start and end timestamps of a target video segment which contains the interested activity according to a given sentence query.…”
Section: Introductionmentioning
confidence: 99%