Fine-grained Iterative Attention Network for Temporal Language Localization in Videos

Qu, Xiaoye; Tang, Pengwei; Zou, Zhikang; Dong, Jianfeng; Zhou, Pan; Xu, Zichuan

doi:10.1145/3394171.3414053

Cited by 74 publications

(48 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This section will compare our method with several state of the art methods. Since our model belongs to the one-stage methods, we mainly compare it with one-stage methods, which are ABLR [32], ExCL [9], DEBUG [18], TMLGA [24], HVTG [5], VSLnet [34], GDP [4], DRN [33], FIAN [22] and VLG-Net [21]. To further illustrate the effect, we also give the score of some two-stages methods including CTRL [8], SLTA [12], ACRN [16], CBP [26] and 2D-TAN [35].…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

“…The following works [12,15,22] mainly focus on constructing a better interaction model for candidates and query sentence. Jiang et al take advantage of the object-level feature to mine specific details in videos.…”

Section: Temporal Moment Localizationmentioning

confidence: 99%

“…Current methods for this task can be grouped into two categories: two-stages and one-stage. Most of two-stages methods [12,22] adopt the proposal-and-rank pipeline, which first generates many multi-scale candidate moments and then ranks them according to their relevance scores to the query language. Although obtaining huge success, the dense sampling strategy leads to high computation burden, especially for very long videos.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Local-enhanced Interaction for Temporal Moment Localization

Liang

Zhang

2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to reweight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model. CCS CONCEPTS• Information systems → Novelty in information retrieval.

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

Section: Temporal Moment Localizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Local-enhanced Interaction for Temporal Moment Localization

Liang

Zhang

2021

Proceedings of the 2021 International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

show abstract

“…Zhang et al [9] first explored the fine-grained semantic information in both videos and sentences and then captured the multi-stage cross-modal interactions. Qu et al [28] proposed the iterative attention module to excavate the grounding clues from both visual and textual modalities. Liu et al [29] reformulated this work as an iterative message passing process over a joint graph that consists of the crossmodal and self-modal relation graphs.…”

Section: A Temporal Sentence Groundingmentioning

confidence: 99%

“…• FIAN [28]: The FIAN method proposes the iterative attention module, where the visual and textual features reinforce each other to generate robust sentence-aware video representation. Table 1 and 2 report the quantitative performance comparison results on ActivityNet Caption and TACoS datasets, respectively.…”

Section: Performance Comparisonsmentioning

confidence: 99%

Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Yang

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Temporal sentence grounding aims to ground a query sentence into a specific segment of the video. Previous methods follow the common equally-spaced frame selection mechanism for appearance and motion modeling, which fails to consider redundant and distracting visual information. There is also no guarantee that all meaningful frames can be obtained. Moreover, this task needs to detect the location clues precisely from both spatial and temporal dimensions, but the relationship between spatialtemporal semantic information and query sentence is still unexplored in existing methods. Inspired by human thinking patterns, we propose a Coarse-to-Fine Spatial-Temporal Relationship Inference (CFSTRI) network to progressively localize fine-grained activity segments. Firstly, we present a coarse-grained crucial frame selection module, where the query-guided local difference context modeling from adjacent frames helps discriminate all the coarse boundary locations relevant to the sentence semantics, and the soft assignment vector of locally aggregated descriptors are employed to enhance the representation of selected frames. Then, we develop a fine-grained spatial-temporal relationship matching module to refine the coarse boundaries, which disentangles the spatial and temporal semantic information from query sentence to guide the excavation of visual grounding clues of corresponding dimensions. Furthermore, we devise a gated graph convolution network to incorporate the spatial-temporal semantic information by leveraging a gate operation to highlight frames referred to by the query sentence from spatial and temporal dimensions, and propagate fused information on the graph. Extensive experiments on two benchmark datasets demonstrate that our CFSTRI significantly outperforms most state-of-the-art methods.

show abstract

Multi-grained Cascade Interaction Network for Temporal Activity Localization via Language

Song¹

2022

Pattern Recognition and Computer Vision

View full text Add to dashboard Cite

Fine-grained Iterative Attention Network for Temporal Language Localization in Videos

Cited by 74 publications

References 45 publications

Local-enhanced Interaction for Temporal Moment Localization

Local-enhanced Interaction for Temporal Moment Localization

Coarse-to-Fine Spatial-Temporal Relationship Inference for Temporal Sentence Grounding

Multi-grained Cascade Interaction Network for Temporal Activity Localization via Language

Contact Info

Product

Resources

About