2023
DOI: 10.48550/arxiv.2302.09473
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Abstract: While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-space multi-grained supervised learning framework, SUMA, to learn an aligned representation space shared between the video and the text for video-text retrieval. The shared aligned space is initialized with a finite number of concept clusters, each of which refers to a number of basic concepts (words). With the text data at hand, we are able to update the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 29 publications
0
2
0
Order By: Relevance
“…Specifically, graph convolution will concentrate all the embeddings of similar nodes which might lead to the concentration of similarity (de la Pena and Montgomery-Smith, 1995) and the data degeneration problem (Baranwal et al, 2023). On the other side, a similar operation, average pooling, has been employed in computer vision (He et al, 2016;Wang and Shi, 2023). Average pooling will aggregate the features that are location-based similar 3 .…”
Section: Mathematical Intuitionmentioning
confidence: 99%
See 1 more Smart Citation
“…Specifically, graph convolution will concentrate all the embeddings of similar nodes which might lead to the concentration of similarity (de la Pena and Montgomery-Smith, 1995) and the data degeneration problem (Baranwal et al, 2023). On the other side, a similar operation, average pooling, has been employed in computer vision (He et al, 2016;Wang and Shi, 2023). Average pooling will aggregate the features that are location-based similar 3 .…”
Section: Mathematical Intuitionmentioning
confidence: 99%
“…For example, in text-to-video retrieval, the objective is to rank gallery videos based on the features of the query text. Recently, inspired by the success in self-supervised learning (Radford et al, 2021), significant progress has been made in CMR, including image-text retrieval (Radford et al, 2021;Li et al, 2020;Wang et al, 2020a), video-text retrieval (Chen et al, 2020;Cheng et al, 2021;Gao et al, 2021;Lei et al, 2021;Ma et al, 2022;Park et al, 2022;Wang et al, 2022a,b;Zhao et al, 2022;Wang and Shi, 2023;, and audiotext retrieval (Oncescu et al, 2021), with satisfactory retrieval performances.…”
Section: Introductionmentioning
confidence: 99%