STCM-Net: A symmetrical one-stage network for temporal language localization in videos

Jia, Zhenhong; Dong, Minglin; Ru, Jingyu; Xue, Lele; Yang, Sikai; Li, Chunbo

doi:10.1016/j.neucom.2021.11.019

Cited by 7 publications

(2 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this sense, multimodal interaction is overlooked. To remedy, PLN [89], SMIN [53], CLEAR [90], and STCM-Net [91] disentangle video proposals into different temporal granularities [89], [91] or different semantic contents [53], [90], and perform crossmodal reasoning at both coarse-and fine-grained granularities. VLG-Net [92] and RaNet [54] maintain query words and video proposals in a graph, and adopt GCN [4], [93] to conduct intra-and inter-modal interactions for cross-modal reasoning.…”

Section: Temporal Adjacent Networkmentioning

confidence: 99%

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang¹,

Sun²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.

show abstract

Section: Temporal Adjacent Networkmentioning

confidence: 99%

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang¹,

Sun²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In this sense, multimodal interaction is overlooked. To remedy, PLN [79], SMIN [57], CLEAR [80], and STCM-Net [81] disentangle video proposals into different temporal granularities [79,81] or different semantic contents [57,80], and perform cross-modal reasoning at both coarse-and fine-grained granularities. VLG-Net [82] and RaNet [58] maintain query words and video proposals in a graph, and adopt GCN [83,84] to conduct intra-and inter-modal interactions for cross-modal reasoning.…”

Section: D-mapmentioning

confidence: 99%

Towards temporal sentence grounding in videos

Zhang¹

View full text Add to dashboard Cite

show abstract

BMRN: Boundary Matching and Refinement Network for Temporal Moment Localization with Natural Language

Seol,

Kim,

Moon

2023

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

STCM-Net: A symmetrical one-stage network for temporal language localization in videos

Cited by 7 publications

References 7 publications

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Towards temporal sentence grounding in videos

BMRN: Boundary Matching and Refinement Network for Temporal Moment Localization with Natural Language

Contact Info

Product

Resources

About