Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018
DOI: 10.18653/v1/p18-1209
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Low-rank Multimodal Fusion With Modality-Specific Factors

Abstract: Multimodal research is an emerging field of artificial intelligence, and one of the main research problems in this field is multimodal fusion. The fusion of multimodal data is the process of integrating multiple unimodal representations into one compact multimodal representation. Previous research in this field has exploited the expressiveness of tensors for multimodal representation. However, these methods often suffer from exponential increase in dimensions and in computational complexity introduced by trans… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
233
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 522 publications
(256 citation statements)
references
References 31 publications
1
233
0
Order By: Relevance
“…For the task of image-text matching, Wang et al (2017) compare an embedding network that projects texts and photos into a joint space where semantically-similar texts and photos are close to each other, with a similarity network that fuses text embeddings and photo embeddings via element multiplication. For the task of sentiment analysis, ; Ghosal et al (2018); ; Liu et al (2018) propose several models, namely contextual intermodal attention, dynamic fusion graph, and lowrank multimodal fusion, for integrating visual, audio, and text signals on the CMU-MOSEI data set. There is also research initiative in multimodal summarization (Li et al, 2017) and multimodal translation (Calixto et al, 2017;Delbrouck and Dupont, 2017).…”
Section: Introductionmentioning
confidence: 99%
“…For the task of image-text matching, Wang et al (2017) compare an embedding network that projects texts and photos into a joint space where semantically-similar texts and photos are close to each other, with a similarity network that fuses text embeddings and photo embeddings via element multiplication. For the task of sentiment analysis, ; Ghosal et al (2018); ; Liu et al (2018) propose several models, namely contextual intermodal attention, dynamic fusion graph, and lowrank multimodal fusion, for integrating visual, audio, and text signals on the CMU-MOSEI data set. There is also research initiative in multimodal summarization (Li et al, 2017) and multimodal translation (Calixto et al, 2017;Delbrouck and Dupont, 2017).…”
Section: Introductionmentioning
confidence: 99%
“…The tensor fusion approach (TF) computes a tensor including uni-modal, bi-modal, and tri-modal combination information. LMF (Liu et al, 2018) is a tensor fusion method that performs tensor factorization using the same rank for all the modalities in order to reduce the number of parameters. Our proposed method aims to use different factors for each modality.…”
Section: Model Architecturementioning
confidence: 99%
“…Our method, Modality-based Redundancy Reduction multimodal Fusion (MRRF), builds on recent work in mutimodal fusion utilizing first an outer product tensor of input modalities to better capture inter-modality dependencies and a recent approach to reduce the number of elements in the resulting tensor through low rank factorization (Liu et al, 2018). Whereas the factorization used in (Liu et al, 2018) utilizes a single compression rate across all modalities, we instead use Tuckers tensor decomposition (see the Methodology section), which allows different compression rates for each modality. This allows the model to adapt to variations in the amount of useful information between modalities.…”
Section: Introductionmentioning
confidence: 99%
“…The most common strategy for joint representation of features is through concatenation. Despite the popularity of this strategy, it has been shown to fail to fully capture cross-modal interactions [14,15]. Consequently, several multimodal feature representation strategies have been proposed for various applications [16,14,15,17].…”
Section: Related Workmentioning
confidence: 99%