Efficient Low-rank Multimodal Fusion with Modality-Specific Factors

Liu, Zhun; Shen, Ying; Lakshminarasimhan, Varun; Liang, Paul Pu; Zadeh, Amir; Morency, Louis–Philippe

doi:10.48550/arxiv.1806.00064

Cited by 67 publications

(89 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, how to aggregate information from multi-modal features is the main problem. Multi-modal data or features fusion across different modalities is always a hot topic, some classical methods [27]- [29] were proposed to utilize the linear embedding or attention mechanism to fuse multi-modal features. For example, Hori et al [28] propose a multi-modal attention model that selectively fuses multi-modal features based on learned attention.…”

Section: Multi-modal Fusionmentioning

confidence: 99%

“…The key to successful fusion is how to reinforce the discriminative information while suppressing the irrelevant information among multi-modal features. To this end, some works [27]- [29] propose to utilize the linear attention to selectively fuse multi-modal features. For example, Hori et al [28] propose a multi-modal attention model that selectively fuses multi-modal features based on different attention factors.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Expansion-Squeeze-Excitation Fusion Network for Elderly Activity Recognition

Shu¹,

Yang²,

Yan³

et al. 2021

Preprint

View full text Add to dashboard Cite

This work focuses on the task of elderly activity recognition, which is a challenging task due to the existence of individual actions and human-object interactions in elderly activities. Thus, we attempt to effectively aggregate the discriminative information of actions and interactions from both RGB videos and skeleton sequences by attentively fusing multimodal features. Recently, some nonlinear multi-modal fusion approaches are proposed by utilizing nonlinear attention mechanism that is extended from Squeeze-and-Excitation Networks (SENet). Inspired by this, we propose a novel Expansion-Squeeze-Excitation Fusion Network (ESE-FN) to effectively address the problem of elderly activity recognition, which learns modal and channel-wise Expansion-Squeeze-Excitation (ESE) attentions for attentively fusing the multi-modal features in the modal and channel-wise ways. Specifically, ESE-FN firstly implements the modal-wise fusion with the Modal-wise ESE Attention (M-ESEA) to aggregate discriminative information in modal-wise way, and then implements the channel-wise fusion with the Channelwise ESE Attention (C-ESEA) to aggregate the multi-channel discriminative information in channel-wise way (referring to Figure 1). Furthermore, we design a new Multi-modal Loss (ML) to keep the consistency between the single-modal features and the fused multi-modal features by adding the penalty of difference between the minimum prediction losses on single modalities and the prediction loss on the fused modality. Finally, we conduct experiments on a largest-scale elderly activity dataset, i.e., ETRI-Activity3D (including 110,000+ videos, and 50+ categories), to demonstrate that the proposed ESE-FN achieves the best accuracy compared with the state-of-the-art methods. In addition, more extensive experimental results show that the proposed ESE-FN is also comparable to the other methods in terms of normal action recognition task.

show abstract

Section: Multi-modal Fusionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Expansion-Squeeze-Excitation Fusion Network for Elderly Activity Recognition

Shu¹,

Yang²,

Yan³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…for images ) which allows us to leverage wider modality specific information and b) often but not always each individual modality is in principle enough to correctly predict the output Plethora of neural architectures have been proposed to learn multimodal representations for sentiment classification. Models often rely on a fusion mechanism (Khan et al 2012), tensor factorisation (Liu et al 2018;Zadeh et al 2019) or complex attention mechanisms (Zadeh et al 2018a) that is fed with modality specific representations.…”

Section: Multimodal Fusionmentioning

confidence: 99%

“…feature rich yet efficient representations (Zadeh et al 2017;Liu et al 2018;Hazarika, Zimmermann, and Poria 2020). Recently (Rahman et al 2020) used pre-trained transformer (Tsai et al 2019;Siriwardhana et al 2020) based models to achieve state-of the-art results on multimodal sentiment benchmark MOSI (Wöllmer et al 2013) and MOSEI (Zadeh et al 2018c).…”

Section: Introductionmentioning

confidence: 99%

Neural Dependency Coding inspired Multimodal Fusion

Shankar

2021

Preprint

View full text Add to dashboard Cite

Information integration from different modalities is an active area of research. Human beings and, in general, biological neural systems are quite adept at using a multitude of signals from different sensory perceptive fields to interact with the environment and each other. Recent work in deep fusion models via neural networks has led to substantial improvements over unimodal approaches in areas like speech recognition, emotion recognition and analysis, captioning and image description. However, such research has mostly focused on architectural changes allowing for fusion of different modalities while keeping the model complexity manageable. Inspired by recent neuroscience ideas about multisensory integration and processing, we investigate the effect of introducing neural dependencies in the loss functions. Experiments on multimodal sentiment analysis tasks: CMU-MOSI and CMU-MOSEI with different models show that our approach provides a consistent performance boost.

show abstract

“…While TFN conducts numerous dot product operations in feature space resulting in an increase in computation. Therefore, Liu et al [5] proposed Low-rank Multimodal Fusion (LMF) based on TFN and improved the calculation efficiency by decomposing the high-order tensors. Apart from manipulating geometric property, auxiliary loss is used to aid modal fusion.…”

Section: Introductionmentioning

confidence: 99%

Multimodal Representations Learning Based on Mutual Information Maximization and Minimization and Identity Embedding for Multimodal Sentiment Analysis

Zheng¹,

Zhang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Multimodal sentiment analysis (MSA) is a fundamental complex research problem due to the heterogeneity gap between different modalities and the ambiguity of human emotional expression. Although there have been many successful attempts to construct multimodal representations for MSA, there are still two challenges to be addressed: 1) A more robust multimodal representation needs to be constructed to bridge the heterogeneity gap and cope with the complex multimodal interactions, and 2) the contextual dynamics must be modeled effectively throughout the information flow. In this work, we propose a multimodal representation model based on Mutual information Maximization and Minimization and Identity Embedding (MMMIE). We combine mutual information maximization between modal pairs, and mutual information minimization between input data and corresponding features to mine the modal-invariant and task-related information. Furthermore, Identity Embedding is proposed to prompt the downstream network to perceive the contextual information. Experimental results on two public datasets demonstrate the effectiveness of the proposed model.

show abstract

Efficient Low-rank Multimodal Fusion with Modality-Specific Factors

Cited by 67 publications

References 21 publications

Expansion-Squeeze-Excitation Fusion Network for Elderly Activity Recognition

Expansion-Squeeze-Excitation Fusion Network for Elderly Activity Recognition

Neural Dependency Coding inspired Multimodal Fusion

Multimodal Representations Learning Based on Mutual Information Maximization and Minimization and Identity Embedding for Multimodal Sentiment Analysis

Contact Info

Product

Resources

About