Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

Zhang, Zongmeng; Han, Xianjing; Song, Xuemeng; Yan, Yan; Nie, Liqiang

doi:10.1109/tip.2021.3113791

Cited by 30 publications

(12 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Subsequent work generally follows the strategies of TGN or SCDM with more sophisticated learning modules and/or auxiliary objectives. To be specific, CMIN [50], [78], CBP [79], FIAN [80], HDRR [81], and MIGCN [82] adopt the strategy of TGN, while CSMGAN [83], RMN [84], IA-Net [85], and DCT-Net [86] apply the strategy of SCDM. These solutions design various cross-modal reasoning strategies to perform more fine-grained and deeper multi-modal interaction between video and query, for precise moment localization.…”

Section: Anchor-based Methodsmentioning

confidence: 99%

See 1 more Smart Citation

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang¹,

Sun²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.

show abstract

Section: Anchor-based Methodsmentioning

confidence: 99%

“…Some other work adopt boundary regression module to refine the start and end time points of generated moments. MIGCN [82] develops a rank module apart from boundary regression module to distinguish the optimal proposal from a set of similar proposal candidates. 2D-Map Anchor-based Method.…”

Section: Anchor-based Methodsmentioning

confidence: 99%

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang¹,

Sun²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…TGN [2] temporally captures the evolving fine-grained frame-by-word interactions and uses pre-set anchors to produce multi-scale proposal candidates ending at each time step. Subsequently [15,21,33,34] follow the anchor-based framework and propose various multi-modal reasoning strategies to achieve precise moment localization. In addition, 2D-TAN [32] enumerate all possible segments as proposal candidates and convert them into 2D feature map, then a temporal adjacent network is proposed to obtain multi-modal representation and encode the video context information.…”

Section: Short-form Video Temporal Groundingmentioning

confidence: 99%

Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

Pan¹,

He²,

Gong³

et al. 2023

Preprint

View full text Add to dashboard Cite

Video temporal grounding aims to pinpoint a video segment that matches the query description. Despite the recent advance in short-form videos (e.g., in minutes), temporal grounding in long videos (e.g., in hours) is still at its early stage. To address this challenge, a common practice is to employ a sliding window, yet can be inefficient and inflexible due to the limited number of frames within the window. In this work, we propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with one-time network execution. Our pipeline is formulated in a coarse-to-fine manner, where we first extract context knowledge from non-overlapped video clips (i.e., anchors), and then supplement the anchors that highly response to the query with detailed content knowledge. Besides the remarkably high pipeline efficiency, another advantage of our approach is the capability of capturing long-range temporal correlation, thanks to modeling the entire video as a whole, and hence facilitates more accurate grounding. Experimental results suggest that, on the long-form video datasets MAD and Ego4d, our method significantly outperforms state-ofthe-arts, and achieves 14.6× / 102.8× higher efficiency respectively. Project can be found at https://github. com/afcedf/SOONet.git.

show abstract

“…HDRR [71], and MIGCN [72] adopt the strategy of TGN, while CSMGAN [73], RMN [74], IA-Net [75], and DCT-Net [76] apply the strategy of SCDM. These solutions design various crossmodal reasoning strategies to perform more fine-grained and deeper multi-modal interaction between video and query, for precise moment localization.…”

Section: Temporal Adjacent Networkmentioning

confidence: 99%

“…Some other works adopt boundary regression module to refine the start and end timestamps of generated moments. MIGCN [72] develops a rank module apart from the boundary regression module to distinguish the optimal proposal from a set of similar proposal candidates. Before 2D-Map methods, a prior work TMN [77] first proposes to enumerate all possible consecutive segments as proposals and predict the best-matched proposal as result through interacting each proposal with query.…”

Section: Temporal Adjacent Networkmentioning

confidence: 99%

Towards temporal sentence grounding in videos

Zhang¹

View full text Add to dashboard Cite

show abstract

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

Cited by 30 publications

References 51 publications

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

Towards temporal sentence grounding in videos

Contact Info

Product

Resources

About