X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

Gorti, Satya Krishna; Vouitsis, Noel; Ma, Junwei; Golestan, Keyvan; Volkovs, Maksims; Garg, Animesh; Yu, Guangwei

doi:10.1109/cvpr52688.2022.00495

Cited by 68 publications

(44 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…b) The Pretrained Model based Video-Text Retrieval Models: The pre-trained model based video-text retrieval methods [6], [7], [21], [49] transfer the ability of the pretrained model to the cross-modal retrieval task by fine-tuning in the downstream datasets.…”

Section: Methods Splitmentioning

confidence: 99%

“…• CLIP2Video (C2V) [7] presents a temporal difference block to capture motions at fine temporal video frames, and a temporal alignment block to re-align the token of video clips and phrases and improve the multi-modal matching. • X-Pool [49] focuses on the difference of information between video and text and proposes an x-pool strategy that main mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames.…”

Section: Methods Splitmentioning

confidence: 99%

See 1 more Smart Citation

Debiased Video-Text Retrieval via Soft Positive Sample Calibration

Zhang

Yang

et al. 2023

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

With the emergence of enormous videos on various video apps, semantic video-text retrieval has become a critical task for improving the user experience. The primary paradigm for video-text retrieval learns the semantic videotext representations in a common space by pulling the positive samples close to the query and pushing the negative samples away. However, in practice, the video-text datasets contain only the annotations of positive samples. The negative samples are randomly drawn from the entire dataset. There may exist soft positive samples, which are sampled as negatives but share the same semantics as positive samples. Indiscriminately enforcing the model to push all the negative samples away from the query leads to inaccurate supervision and then misleads the video-text feature representation learning. In this paper, we introduce debiased video-text retrieval objectives that calibrate the punishment of soft positive samples. In particular, we propose a novel uncertainty measure framework to estimate the credibility of negative samples for each instance. Then, the reliability of negative samples is used to find the soft positive samples and rescale their contribution within video-text retrieval losses, including triplet loss and contrastive loss. Experimental results on five widely used datasets demonstrate that our debiased video-text retrieval objectives achieve significant performance improvements and establish a new state-of-the-art.

show abstract

Section: Methods Splitmentioning

confidence: 99%

Section: Methods Splitmentioning

confidence: 99%

Debiased Video-Text Retrieval via Soft Positive Sample Calibration

Zhang

Yang

et al. 2023

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

show abstract

“…Inspired by this, recent works (Lei et al, 2021;Gao et al, 2021;Park et al, 2022;Cheng et al, 2021;Wang et al, 2022a,b;Zhao et al, 2022;Gorti et al, 2022) have attempted to pretrain or fine-tune video-text retrieval models in an end-to-end manner. CLIPBERT (Lei et al, 2021;Bain et al, 2021), as a pioneer, proposes to sparsely sample video clips for end-to-end training to obtain clip-level predictions and then summarize them.…”

Section: Related Workmentioning

confidence: 99%

“…To show the empirical efficiency of our SUMA, we train models on MSR-VTT (Xu et al, 2016), MSVD (Chen and Dolan, 2011), and Activi-tyNet (Fabian Caba Heilbron and Niebles, 2015). For a fair comparison, we only compare our methods with methods that are based on CLIP (Radford et al, 2021), i.e., Clip4Clip (Luo et al, 2022), CLIP2TV (Gao et al, 2021), X-CLIP , DiscreteCodebook (Liu et al, 2022a), TS2-Net (Liu et al, 2022b), CLIP2Video (Park et al, 2022), VCM , HiSE (Wang et al, 2022a), Align&Tell (Wang et al, 2022b), Center-CLIP (Zhao et al, 2022), and X-Pool (Gorti et al, 2022). Implementation details and evaluation protocols are deferred to the Appendix.…”

Section: Datasets and Baselinesmentioning

confidence: 99%

“…As a fundamental task in visual-language understanding (Xu et al, 2021;Park et al, 2022;Miyawaki et al, 2022), video-text retrieval (VTR) (Luo et al, 2022;Gao et al, 2021;Liu et al, 2022a;Zhao et al, 2022;Gorti et al, 2022) has attracted interest from academia and industry. Although recent years have witnessed the rapid development of VTR with the support from powerful pretraining models (Luo et al, 2022;Gao et al, 2021;Liu et al, 2022a), improved retrieval methods (Bertasius et al, 2021;Dong et al, 2019;, and videolanguage datasets construction (Xu et al, 2016), it remains challenging to precisely match video and language due to the raw data being in heterogeneous spaces with significant differences.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Wang¹,

Shi²

2023

Preprint

View full text Add to dashboard Cite

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-space multi-grained supervised learning framework, SUMA, to learn an aligned representation space shared between the video and the text for video-text retrieval. The shared aligned space is initialized with a finite number of concept clusters, each of which refers to a number of basic concepts (words). With the text data at hand, we are able to update the shared aligned space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarity. Benefiting from learned shared aligned space and multi-grained similarity, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of SUMA over existing methods.

show abstract