2021
DOI: 10.48550/arxiv.2106.05438
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Cross-Modal Discrete Representation Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(8 citation statements)
references
References 0 publications
0
8
0
Order By: Relevance
“…This is why research then focused on various techniques for a more reliable unification of different modalities. The most recent studies on cross-modality representation usually imply the projection of each modality input into a same semantic (or latent) space [16][17] [18]. This projection can be done after a preprocessing of the raw inputs [18] and it seems more suitable to project low and medium-level features of the input in order to have a better similarity recognition between the inputs.…”
Section: Multimodalitymentioning
confidence: 99%
See 2 more Smart Citations
“…This is why research then focused on various techniques for a more reliable unification of different modalities. The most recent studies on cross-modality representation usually imply the projection of each modality input into a same semantic (or latent) space [16][17] [18]. This projection can be done after a preprocessing of the raw inputs [18] and it seems more suitable to project low and medium-level features of the input in order to have a better similarity recognition between the inputs.…”
Section: Multimodalitymentioning
confidence: 99%
“…The most recent studies on cross-modality representation usually imply the projection of each modality input into a same semantic (or latent) space [16][17] [18]. This projection can be done after a preprocessing of the raw inputs [18] and it seems more suitable to project low and medium-level features of the input in order to have a better similarity recognition between the inputs. However, the literature presents usually methods for generative tasks or for discriminative pairing tasks (e.g.…”
Section: Multimodalitymentioning
confidence: 99%
See 1 more Smart Citation
“…Most existing video-text retrieval frameworks (Wang, Zhu, and Yang 2021;Portillo-Quintero, Ortiz-Bayliss, and Terashima-Marín 2021;Luo et al 2021;Liu et al 2021a;Chen et al 2020;Mithun et al 2018;Wang, Zhu, and Yang 2021;Liu et al 2019;Dzabraev et al 2021;Lei et al 2021) focus on constructing meaningful representations for video and text, which contain essential information in their respective modalities, such as motion information for video and the internal relevance of part-of-speech for text. These representations are embedded in a shared space and matched according to their similarity metric.…”
Section: Video-text Retrievalmentioning
confidence: 99%
“…There are a few works on unsupervised audio-visual representation learning. For example, the audio and visual representations can be jointly learned by audio-visual synchronization [8,9,10], correspondence [11,12,13,14] and instance discrimination [15,16]. One can also use one modality as targets to learn another modality [17,18].…”
Section: Introductionmentioning
confidence: 99%