Cross-Modal Discrete Representation Learning

Liu, Alexander H.; Jin, SouYoung; Lai, Cheng-I; Rouditchenko, Andrew; Oliva, Aude; Glass, James

doi:10.18653/v1/2022.acl-long.215

Cited by 21 publications

(19 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further, TS2-Net (Liu et al, 2022b) proposes a novel token shift and selection transformer architecture that adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples. Later, DiscreteCodebook (Liu et al, 2022a) propose to align modalities in a space filled with concepts, which are randomly initialled and unsupervisedly updated, while VCM propose to construct a space with unsupervisedly clustered visual concepts.…”

Section: Related Workmentioning

confidence: 99%

“…In this part, we present a general video-text retrieval framework used by previous methods (Luo et al, 2022;Liu et al, 2022a). With this paradigm, we can obtain three representations for different modalities from the original space, i.e., frame representation r f , video representation r v , and sentence representation r s by modality-dependent encoders.…”

Section: General Video-text Retrieval Paradigmmentioning

confidence: 99%

“…To show the empirical efficiency of our SUMA, we train models on MSR-VTT (Xu et al, 2016), MSVD (Chen and Dolan, 2011), and Activi-tyNet (Fabian Caba Heilbron and Niebles, 2015). For a fair comparison, we only compare our methods with methods that are based on CLIP (Radford et al, 2021), i.e., Clip4Clip (Luo et al, 2022), CLIP2TV (Gao et al, 2021), X-CLIP , DiscreteCodebook (Liu et al, 2022a), TS2-Net (Liu et al, 2022b), CLIP2Video (Park et al, 2022), VCM , HiSE (Wang et al, 2022a), Align&Tell (Wang et al, 2022b), Center-CLIP (Zhao et al, 2022), and X-Pool (Gorti et al, 2022). Implementation details and evaluation protocols are deferred to the Appendix.…”

Section: Datasets and Baselinesmentioning

confidence: 99%

“…As a fundamental task in visual-language understanding (Xu et al, 2021;Park et al, 2022;Miyawaki et al, 2022), video-text retrieval (VTR) (Luo et al, 2022;Gao et al, 2021;Liu et al, 2022a;Zhao et al, 2022;Gorti et al, 2022) has attracted interest from academia and industry. Although recent years have witnessed the rapid development of VTR with the support from powerful pretraining models (Luo et al, 2022;Gao et al, 2021;Liu et al, 2022a), improved retrieval methods (Bertasius et al, 2021;Dong et al, 2019;, and videolanguage datasets construction (Xu et al, 2016), it remains challenging to precisely match video and language due to the raw data being in heterogeneous spaces with significant differences.…”

Section: Introductionmentioning

confidence: 99%

“…However, as the video and text representations often come from modality-independent encoders, it is challenging to directly compare and calculate the similarities between representations of different modalities from different encoders. To alleviate the mismatch caused by heterogeneous encoders and data formats, Liu et al (2022a); proposed to align modalities in a common space filled with learned concepts without supervision. However, because of the unsupervised design, these concepts are either randomly initialized or updated in an unsupervised fashion, which blocks the power of that aligned space.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Wang¹,

Shi²

2023

Preprint

View full text Add to dashboard Cite

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-space multi-grained supervised learning framework, SUMA, to learn an aligned representation space shared between the video and the text for video-text retrieval. The shared aligned space is initialized with a finite number of concept clusters, each of which refers to a number of basic concepts (words). With the text data at hand, we are able to update the shared aligned space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarity. Benefiting from learned shared aligned space and multi-grained similarity, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of SUMA over existing methods.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: General Video-text Retrieval Paradigmmentioning

confidence: 99%

Section: Datasets and Baselinesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Wang¹,

Shi²

2023

Preprint

View full text Add to dashboard Cite

show abstract

Intent Detection for Virtual Reality Architectural Design

Pailhès

Elise

Laville

et al. 2023

Product Lifecycle Management. PLM in Transition Times: The Place of Humans and Transformative Technologies

View full text Add to dashboard Cite

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

show abstract