Fast Video Moment Retrieval

Gao, Junyu; Xu, Changsheng

doi:10.1109/iccv48922.2021.00155

Cited by 61 publications

(25 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Instead of using the simple Hadamard product, DMN [96] proposes to project proposals and query features to common embedding space and leverage metric learning for cross-modal pair discrimination. Moreover, FVMR [55] claims that the standard cross-modal interaction module is inefficient and replaces it with a semantic embedding module to model multimodal interaction.…”

Section: Temporal Adjacent Networkmentioning

confidence: 99%

See 1 more Smart Citation

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang¹,

Sun²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.

show abstract

Section: Temporal Adjacent Networkmentioning

confidence: 99%

“…Another version of anchor-based strategy is 2D-Map strategy [18], [52]- [55]. Different from the standard anchor-based strategy above, 2D-Map strategy is usually applied after feature extractor, i.e., before answer predictor.…”

Section: Proposal Generationmentioning

confidence: 99%

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

Zhang¹,

Sun²,

Wei³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Fine-grained Query Feature. In order to ultimately obtain the fine-grained query feature q 𝑢 , one off-the-shelf toolkit [12,46] is used to parse the sentence into a semantic role tree. By adopting hierarchical attention mechanism on the tree, we can get the phrase-level features {g 𝑘 } 𝑁 𝑣𝑒𝑟𝑏 1 .…”

Section: Gatedmentioning

confidence: 99%

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach

Lan¹,

Yuan²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural language sentence that indicates complex human activities in an untrimmed video, has drawn widespread attention over the past few years. However, recent studies have found that current benchmark datasets may have obvious moment annotation biases, enabling several simple baselines even without training to achieve state-of-the-art (SOTA) performance. In this paper, we take a closer look at existing evaluation protocols for TSGV, and find that both the prevailing dataset splits and evaluation metrics are the devils that lead to untrustworthy benchmarking. Therefore, we propose to re-organize the two widely-used datasets, making the ground-truth moment distributions different in the training and test splits, i.e., out-of-distribution (OOD) test. Meanwhile, we introduce a new evaluation metric "dR@𝑛,IoU@𝑚" that discounts the basic recall scores especially with small IoU thresholds, so as to alleviate the inflating evaluation caused by biased datasets with a large proportion of long ground-truth moments. New benchmarking results indicate that our proposed evaluation protocols can better monitor the research progress in TSGV. Furthermore, we propose a novel causality-based Multi-branch Deconfounding Debiasing (MDD) framework for unbiased moment prediction. Specifically, we design a multi-branch deconfounder to eliminate the effects caused by multiple confounders with causal intervention. In order to help the model better align the semantics between sentence queries and video moments, we enhance the representations during feature encoding. Specifically, for textual information, the query is parsed into several verb-centered phrases to obtain a more fine-grained textual feature. For visual information, the positional information has been decomposed from the moment features to enhance the representations of moments with diverse locations. Extensive experiments demonstrate that our proposed approach can achieve competitive results among existing SOTA approaches and outperform the base model with great gains.

show abstract

“…Recently, fast video temporal grounding (FVTG) [21] is proposed for accurate temporal localization and a efficient test process. Note that the current VTG pipeline can be divided into three components: video encoder, text encoder, and crossmodal interaction module.…”

mentioning

confidence: 99%

“…Although bringing rich cross-modal interaction information, this module always consumes the majority of the test-time due to complex feature matrix interaction operation [2,9,10] or transformations [27]. Different from the above approaches, to calculate the similarity scores between video moments and texts, common space is utilized in FVTG [21], where the efficient vector operations like dot production between different modality features are conducted. As a result, the common space-based approaches can achieve a significant test speed.…”

mentioning

confidence: 99%

Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding

Wu¹,

Ji²,

Huang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. In this paper, we deal with the fast video temporal grounding (FVTG) task, aiming at localizing the target segment with high speed and favorable accuracy. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance, which suffer from the test-time bottleneck. Although several common space-based methods enjoy the highspeed merit during inference, they can hardly capture the comprehensive and explicit relations between visual and textual modalities. In this paper, to tackle the dilemma of speedaccuracy tradeoff, we propose a commonsense-aware cross-modal alignment (CCA) framework, which incorporates commonsenseguided visual and text representations into a complementary common space for fast video temporal grounding. Specifically, the commonsense concepts are explored and exploited by extracting the structural semantic information from a language corpus. Then, a commonsense-aware interaction module is designed to obtain bridged visual and text features by utilizing the learned commonsense concepts. Finally, to maintain the original semantic information of textual queries, a cross-modal complementary common space is optimized to obtain matching scores for performing FVTG. Extensive results on two challenging benchmarks show that our CCA method performs favorably against state-ofthe-arts while running at high speed. Our code is available at https://github.com/ZiyueWu59/CCA.

show abstract

Fast Video Moment Retrieval

Cited by 61 publications

References 55 publications

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach

Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding

Contact Info

Product

Resources

About