Cross-Modal Discrete Representation Learning

Liu, Alexander H.; Jin, SouYoung; Lai, Cheng-I; Rouditchenko, Andrew; Oliva, Aude; Glass, James

doi:10.48550/arxiv.2106.05438

Cited by 5 publications

(8 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is why research then focused on various techniques for a more reliable unification of different modalities. The most recent studies on cross-modality representation usually imply the projection of each modality input into a same semantic (or latent) space [16][17] [18]. This projection can be done after a preprocessing of the raw inputs [18] and it seems more suitable to project low and medium-level features of the input in order to have a better similarity recognition between the inputs.…”

Section: Multimodalitymentioning

confidence: 99%

“…The most recent studies on cross-modality representation usually imply the projection of each modality input into a same semantic (or latent) space [16][17] [18]. This projection can be done after a preprocessing of the raw inputs [18] and it seems more suitable to project low and medium-level features of the input in order to have a better similarity recognition between the inputs. However, the literature presents usually methods for generative tasks or for discriminative pairing tasks (e.g.…”

Section: Multimodalitymentioning

confidence: 99%

“…We propose a similar architecture to what we have seen in literature [18] but introducing specific features dealing with the asynchronous nature of the different inputsthe different devices don't deliver new data neither at the same time neither nor at the same frequenciesbut also adapt the method to classification tasks and not to similarity cross-modal pairing or to generative tasks. The general approach is described in Fig.…”

Section: General Architecturementioning

confidence: 99%

See 2 more Smart Citations

Intent Detection for Virtual Reality Architectural Design

Pailhès

Elise

Laville

et al. 2023

Product Lifecycle Management. PLM in Transition Times: The Place of Humans and Transformative Technologies

View full text Add to dashboard Cite

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

show abstract

Section: Multimodalitymentioning

confidence: 99%

Section: Multimodalitymentioning

confidence: 99%

Section: General Architecturementioning

confidence: 99%

See 1 more Smart Citation

Intent Detection for Virtual Reality Architectural Design

Pailhès

Elise

Laville

et al. 2023

Product Lifecycle Management. PLM in Transition Times: The Place of Humans and Transformative Technologies

View full text Add to dashboard Cite

show abstract

“…Most existing video-text retrieval frameworks (Wang, Zhu, and Yang 2021;Portillo-Quintero, Ortiz-Bayliss, and Terashima-Marín 2021;Luo et al 2021;Liu et al 2021a;Chen et al 2020;Mithun et al 2018;Wang, Zhu, and Yang 2021;Liu et al 2019;Dzabraev et al 2021;Lei et al 2021) focus on constructing meaningful representations for video and text, which contain essential information in their respective modalities, such as motion information for video and the internal relevance of part-of-speech for text. These representations are embedded in a shared space and matched according to their similarity metric.…”

Section: Video-text Retrievalmentioning

confidence: 99%

Visual Consensus Modeling for Video-Text Retrieval

Cao

Wang²,

Zhang

et al. 2022

AAAI

View full text Add to dashboard Cite

In this paper, we propose a novel method to mine the commonsense knowledge shared between the video and text modalities for video-text retrieval, namely visual consensus modeling. Different from the existing works, which learn the video and text representations and their complicated relationships solely based on the pairwise video-text data, we make the first attempt to model the visual consensus by mining the visual concepts from videos and exploiting their co-occurrence patterns within the video and text modalities with no reliance on any additional concept annotations. Specifically, we build a shareable and learnable graph as the visual consensus, where the nodes denoting the mined visual concepts and the edges connecting the nodes representing the co-occurrence relationships between the visual concepts. Extensive experimental results on the public benchmark datasets demonstrate that our proposed method, with the ability to effectively model the visual consensus, achieves state-of-the-art performances on the bidirectional video-text retrieval task. Our code is available at https://github.com/sqiangcao99/VCM.

show abstract

“…There are a few works on unsupervised audio-visual representation learning. For example, the audio and visual representations can be jointly learned by audio-visual synchronization [8,9,10], correspondence [11,12,13,14] and instance discrimination [15,16]. One can also use one modality as targets to learn another modality [17,18].…”

Section: Introductionmentioning

confidence: 99%

Learning Contextually Fused Audio-Visual Representations For Audio-Visual Speech Recognition

Zhang

Jian-Shu³

et al. 2022

2022 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR) performance, as the multimodal inputs contain more fruitful information in principle. In this paper, based on existing self-supervised representation learning methods for audio modality, we therefore propose an audio-visual representation learning approach. The proposed approach explores both the complementarity of audio-visual modalities and long-term context dependency using a transformer-based fusion module and a flexible masking strategy. After pre-training, the model is able to extract fused representations required by AVSR. Without loss of generality, it can be applied to single-modal tasks, e.g., audio/visual speech recognition by simply masking out one modality in the fusion module. The proposed pre-trained model is evaluated on speech recognition and lipreading tasks using one or two modalities, where the superiority is revealed.

show abstract

Cross-Modal Discrete Representation Learning

Cited by 5 publications

References 0 publications

Intent Detection for Virtual Reality Architectural Design

Intent Detection for Virtual Reality Architectural Design

Visual Consensus Modeling for Video-Text Retrieval

Learning Contextually Fused Audio-Visual Representations For Audio-Visual Speech Recognition

Contact Info

Product

Resources

About