Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Wang, Yimu; Shi, Peng

doi:10.48550/arxiv.2302.09473

Cited by 1 publication

(2 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, graph convolution will concentrate all the embeddings of similar nodes which might lead to the concentration of similarity (de la Pena and Montgomery-Smith, 1995) and the data degeneration problem (Baranwal et al, 2023). On the other side, a similar operation, average pooling, has been employed in computer vision (He et al, 2016;Wang and Shi, 2023). Average pooling will aggregate the features that are location-based similar 3 .…”

Section: Mathematical Intuitionmentioning

confidence: 99%

“…For example, in text-to-video retrieval, the objective is to rank gallery videos based on the features of the query text. Recently, inspired by the success in self-supervised learning (Radford et al, 2021), significant progress has been made in CMR, including image-text retrieval (Radford et al, 2021;Li et al, 2020;Wang et al, 2020a), video-text retrieval (Chen et al, 2020;Cheng et al, 2021;Gao et al, 2021;Lei et al, 2021;Ma et al, 2022;Park et al, 2022;Wang et al, 2022a,b;Zhao et al, 2022;Wang and Shi, 2023;, and audiotext retrieval (Oncescu et al, 2021), with satisfactory retrieval performances.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

InvGC: Robust Cross-Modal Retrieval by Inverse Graph Convolution

Jian,

Wang

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Over recent decades, significant advancements in cross-modal retrieval are mainly driven by breakthroughs in visual and linguistic modeling. However, a recent study shows that multi-modal data representations tend to cluster within a limited convex cone (as representation degeneration problem), which hinders retrieval performance due to the inseparability of these representations. In our study, we first empirically validate the presence of the representation degeneration problem across multiple cross-modal benchmarks and methods. Next, to address it, we introduce a novel method, called INVGC, a post-processing technique inspired by graph convolution and average pooling. Specifically, INVGC defines the graph topology within the datasets and then applies graph convolution in a subtractive manner. This method effectively separates representations by increasing the distances between data points. To improve the efficiency and effectiveness of INVGC, we propose an advanced graph topology, LOCALADJ, which only aims to increase the distances between each data point and its nearest neighbors. To understand why INVGC works, we present a detailed theoretical analysis, proving that the lower bound of recall will be improved after deploying INVGC. Extensive empirical results show that INVGC and IN-VGC w/LOCALADJ significantly mitigate the representation degeneration problem, thereby enhancing retrieval performance. Our code is available at link.

show abstract

Section: Mathematical Intuitionmentioning

confidence: 99%