2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00155
|View full text |Cite
|
Sign up to set email alerts
|

Fast Video Moment Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
25
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 61 publications
(25 citation statements)
references
References 55 publications
0
25
0
Order By: Relevance
“…Instead of using the simple Hadamard product, DMN [96] proposes to project proposals and query features to common embedding space and leverage metric learning for cross-modal pair discrimination. Moreover, FVMR [55] claims that the standard cross-modal interaction module is inefficient and replaces it with a semantic embedding module to model multimodal interaction.…”
Section: Temporal Adjacent Networkmentioning
confidence: 99%
See 1 more Smart Citation
“…Instead of using the simple Hadamard product, DMN [96] proposes to project proposals and query features to common embedding space and leverage metric learning for cross-modal pair discrimination. Moreover, FVMR [55] claims that the standard cross-modal interaction module is inefficient and replaces it with a semantic embedding module to model multimodal interaction.…”
Section: Temporal Adjacent Networkmentioning
confidence: 99%
“…Another version of anchor-based strategy is 2D-Map strategy [18], [52]- [55]. Different from the standard anchor-based strategy above, 2D-Map strategy is usually applied after feature extractor, i.e., before answer predictor.…”
Section: Proposal Generationmentioning
confidence: 99%
“…Fine-grained Query Feature. In order to ultimately obtain the fine-grained query feature q 𝑢 , one off-the-shelf toolkit [12,46] is used to parse the sentence into a semantic role tree. By adopting hierarchical attention mechanism on the tree, we can get the phrase-level features {g 𝑘 } 𝑁 𝑣𝑒𝑟𝑏 1 .…”
Section: Gatedmentioning
confidence: 99%
“…Recently, fast video temporal grounding (FVTG) [21] is proposed for accurate temporal localization and a efficient test process. Note that the current VTG pipeline can be divided into three components: video encoder, text encoder, and crossmodal interaction module.…”
mentioning
confidence: 99%
“…Although bringing rich cross-modal interaction information, this module always consumes the majority of the test-time due to complex feature matrix interaction operation [2,9,10] or transformations [27]. Different from the above approaches, to calculate the similarity scores between video moments and texts, common space is utilized in FVTG [21], where the efficient vector operations like dot production between different modality features are conducted. As a result, the common space-based approaches can achieve a significant test speed.…”
mentioning
confidence: 99%