Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3414053
|View full text |Cite
|
Sign up to set email alerts
|

Fine-grained Iterative Attention Network for Temporal Language Localization in Videos

Abstract: Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query. To tackle this task, designing an effective model to extract grounding information from both visual and textual modalities is crucial. However, most previous attempts in this field only focus on unidirectional interactions from video to query, which emphasizes which words to listen and attends to sentence information via vanilla soft attention, but clues from query-by-video interacti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
48
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 74 publications
(48 citation statements)
references
References 45 publications
0
48
0
Order By: Relevance
“…This section will compare our method with several state of the art methods. Since our model belongs to the one-stage methods, we mainly compare it with one-stage methods, which are ABLR [32], ExCL [9], DEBUG [18], TMLGA [24], HVTG [5], VSLnet [34], GDP [4], DRN [33], FIAN [22] and VLG-Net [21]. To further illustrate the effect, we also give the score of some two-stages methods including CTRL [8], SLTA [12], ACRN [16], CBP [26] and 2D-TAN [35].…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…This section will compare our method with several state of the art methods. Since our model belongs to the one-stage methods, we mainly compare it with one-stage methods, which are ABLR [32], ExCL [9], DEBUG [18], TMLGA [24], HVTG [5], VSLnet [34], GDP [4], DRN [33], FIAN [22] and VLG-Net [21]. To further illustrate the effect, we also give the score of some two-stages methods including CTRL [8], SLTA [12], ACRN [16], CBP [26] and 2D-TAN [35].…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…The following works [12,15,22] mainly focus on constructing a better interaction model for candidates and query sentence. Jiang et al take advantage of the object-level feature to mine specific details in videos.…”
Section: Temporal Moment Localizationmentioning
confidence: 99%
See 1 more Smart Citation
“…Zhang et al [9] first explored the fine-grained semantic information in both videos and sentences and then captured the multi-stage cross-modal interactions. Qu et al [28] proposed the iterative attention module to excavate the grounding clues from both visual and textual modalities. Liu et al [29] reformulated this work as an iterative message passing process over a joint graph that consists of the crossmodal and self-modal relation graphs.…”
Section: A Temporal Sentence Groundingmentioning
confidence: 99%
“…• FIAN [28]: The FIAN method proposes the iterative attention module, where the visual and textual features reinforce each other to generate robust sentence-aware video representation. Table 1 and 2 report the quantitative performance comparison results on ActivityNet Caption and TACoS datasets, respectively.…”
Section: Performance Comparisonsmentioning
confidence: 99%