Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.215
|View full text |Cite
|
Sign up to set email alerts
|

Cross-Modal Discrete Representation Learning

Abstract: In contrast to recent advances focusing on highlevel representation learning across modalities, in this work we present a self-supervised learning framework that is able to learn a representation that captures finer levels of granularity across different modalities such as concepts or events represented by visual objects or spoken words. Our framework relies on a discretized embedding space created via vector quantization that is shared across different modalities. Beyond the shared embedding space, we propose… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 19 publications
(18 citation statements)
references
References 21 publications
0
9
0
Order By: Relevance
“…Further, TS2-Net (Liu et al, 2022b) proposes a novel token shift and selection transformer architecture that adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples. Later, DiscreteCodebook (Liu et al, 2022a) propose to align modalities in a space filled with concepts, which are randomly initialled and unsupervisedly updated, while VCM propose to construct a space with unsupervisedly clustered visual concepts.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Further, TS2-Net (Liu et al, 2022b) proposes a novel token shift and selection transformer architecture that adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples. Later, DiscreteCodebook (Liu et al, 2022a) propose to align modalities in a space filled with concepts, which are randomly initialled and unsupervisedly updated, while VCM propose to construct a space with unsupervisedly clustered visual concepts.…”
Section: Related Workmentioning
confidence: 99%
“…In this part, we present a general video-text retrieval framework used by previous methods (Luo et al, 2022;Liu et al, 2022a). With this paradigm, we can obtain three representations for different modalities from the original space, i.e., frame representation r f , video representation r v , and sentence representation r s by modality-dependent encoders.…”
Section: General Video-text Retrieval Paradigmmentioning
confidence: 99%
See 3 more Smart Citations