Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization

Liu, Daizong; Qu, Xiaoye; Liu, Xiaoyang; Dong, Jianfeng; Zhou, Pan; Xu, Zichuan

doi:10.1145/3394171.3414026

Cited by 105 publications

(62 citation statements)

References 44 publications

(92 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides, the performance of ExCL reduces when the metric becomes stricter. Even compared to the newly proposed two-stages methods [15,21], our model is competitive. In short, all these experiments show the effectiveness of the proposed method.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 91%

“…The following works [12,15,22] mainly focus on constructing a better interaction model for candidates and query sentence. Jiang et al take advantage of the object-level feature to mine specific details in videos.…”

Section: Temporal Moment Localizationmentioning

confidence: 99%

“…Zhang et al [35] design a two-dimensional temporal map for modeling the temporal adjacent relations of video moments. In [15], a cross-and self-modal graph attention network is proposed to grasp the relations between cross-and self-modal. To improve the efficiency, some works also investigate how to reduce the amount of video proposals or compute the relevance score in a single pass [3,28].…”

Section: Temporal Moment Localizationmentioning

confidence: 99%

See 2 more Smart Citations

Local-enhanced Interaction for Temporal Moment Localization

Liang

Zhang

2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to reweight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model. CCS CONCEPTS• Information systems → Novelty in information retrieval.

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 91%

Section: Temporal Moment Localizationmentioning

confidence: 99%

Section: Temporal Moment Localizationmentioning

confidence: 99%

See 1 more Smart Citation

Local-enhanced Interaction for Temporal Moment Localization

Liang

Zhang

2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

show abstract

“…Qu et al [28] proposed the iterative attention module to excavate the grounding clues from both visual and textual modalities. Liu et al [29] reformulated this work as an iterative message passing process over a joint graph that consists of the crossmodal and self-modal relation graphs. Although these methods have achieved good results, they are seriously limited by the quality of candidate proposals and computing cost.…”

Section: A Temporal Sentence Groundingmentioning

confidence: 99%

Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Yang

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Temporal sentence grounding aims to ground a query sentence into a specific segment of the video. Previous methods follow the common equally-spaced frame selection mechanism for appearance and motion modeling, which fails to consider redundant and distracting visual information. There is also no guarantee that all meaningful frames can be obtained. Moreover, this task needs to detect the location clues precisely from both spatial and temporal dimensions, but the relationship between spatialtemporal semantic information and query sentence is still unexplored in existing methods. Inspired by human thinking patterns, we propose a Coarse-to-Fine Spatial-Temporal Relationship Inference (CFSTRI) network to progressively localize fine-grained activity segments. Firstly, we present a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames helps discriminate all the coarse boundary locations relevant to the sentence semantics, and the soft assignment vector of locally aggregated descriptors are employed to enhance the representation of selected frames. Then, we develop a fine-grained spatial-temporal relationship matching module to refine the coarse boundaries, which disentangles the spatial and temporal semantic information from query sentence to guide the excavation of visual grounding clues of corresponding dimensions. Furthermore, we devise a gated graph convolution network to incorporate the spatial-temporal semantic information by leveraging a gate operation to highlight frames referred to by the query sentence from spatial and temporal dimensions, and propagate fused information on the graph. Extensive experiments on two benchmark datasets demonstrate that our CFSTRI significantly outperforms most state-of-the-art methods.

show abstract

“…As most videos contain activities of interest with complicated background contents, these videos cannot be directly indicated by a pre-defined list of action classes. Recently, a new task called temporal sentence localization in videos (Gao et al, 2017;Anne Hendricks et al, 2017) is proposed to tackle this problem, attracting great interests from both vision and language communities (Liu et al, 2020;. Given an untrimmed video, this task aims to infer the start and end timestamps of a target video segment which contains the interested activity according to a given sentence query.…”

Section: Introductionmentioning

confidence: 99%

Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network

Liu¹,

Qu²,

Dong³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

Self Cite

View full text Add to dashboard Cite

Temporal sentence localization in videos aims to ground the best matched segment in an untrimmed video according to a given sentence query. Previous works in this field mainly rely on single-step attentional frameworks to align the temporal boundaries by a soft selection. Although they focus on the visual content relevant to the query, these attention strategies are insufficient to model complex video contents and restrict the higher-level reasoning demand for temporal relation. In this paper, we propose a novel deep rectification-modulation network (RMN), transforming this task into a multi-step reasoning process by repeating rectification and modulation. In each rectification-modulation layer, unlike existing methods directly conducting the cross-modal interaction, we first devise a rectification module to correct implicit attention misalignment which focuses on wrong position during the interaction process. Then, a modulation module is developed to model the frame-to-frame relation with the help of specific sentence information for better correlating and composing the video contents over time. With multiple such layers cascaded in depth, our RMN progressively refines video and query interactions, thus enabling a further precise localization. Experimental evaluations on three public datasets show that the proposed method achieves state-of-the-art performance.

show abstract

Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization

Cited by 105 publications

References 44 publications

Local-enhanced Interaction for Temporal Moment Localization

Local-enhanced Interaction for Temporal Moment Localization

Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network

Contact Info

Product

Resources

About