2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00495
|View full text |Cite
|
Sign up to set email alerts
|

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
30
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 68 publications
(44 citation statements)
references
References 16 publications
0
30
0
Order By: Relevance
“…b) The Pretrained Model based Video-Text Retrieval Models: The pre-trained model based video-text retrieval methods [6], [7], [21], [49] transfer the ability of the pretrained model to the cross-modal retrieval task by fine-tuning in the downstream datasets.…”
Section: Methods Splitmentioning
confidence: 99%
See 1 more Smart Citation
“…b) The Pretrained Model based Video-Text Retrieval Models: The pre-trained model based video-text retrieval methods [6], [7], [21], [49] transfer the ability of the pretrained model to the cross-modal retrieval task by fine-tuning in the downstream datasets.…”
Section: Methods Splitmentioning
confidence: 99%
“…• CLIP2Video (C2V) [7] presents a temporal difference block to capture motions at fine temporal video frames, and a temporal alignment block to re-align the token of video clips and phrases and improve the multi-modal matching. • X-Pool [49] focuses on the difference of information between video and text and proposes an x-pool strategy that main mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames.…”
Section: Methods Splitmentioning
confidence: 99%
“…Inspired by this, recent works (Lei et al, 2021;Gao et al, 2021;Park et al, 2022;Cheng et al, 2021;Wang et al, 2022a,b;Zhao et al, 2022;Gorti et al, 2022) have attempted to pretrain or fine-tune video-text retrieval models in an end-to-end manner. CLIPBERT (Lei et al, 2021;Bain et al, 2021), as a pioneer, proposes to sparsely sample video clips for end-to-end training to obtain clip-level predictions and then summarize them.…”
Section: Related Workmentioning
confidence: 99%
“…To show the empirical efficiency of our SUMA, we train models on MSR-VTT (Xu et al, 2016), MSVD (Chen and Dolan, 2011), and Activi-tyNet (Fabian Caba Heilbron and Niebles, 2015). For a fair comparison, we only compare our methods with methods that are based on CLIP (Radford et al, 2021), i.e., Clip4Clip (Luo et al, 2022), CLIP2TV (Gao et al, 2021), X-CLIP , DiscreteCodebook (Liu et al, 2022a), TS2-Net (Liu et al, 2022b), CLIP2Video (Park et al, 2022), VCM , HiSE (Wang et al, 2022a), Align&Tell (Wang et al, 2022b), Center-CLIP (Zhao et al, 2022), and X-Pool (Gorti et al, 2022). Implementation details and evaluation protocols are deferred to the Appendix.…”
Section: Datasets and Baselinesmentioning
confidence: 99%
See 1 more Smart Citation