Efficient Low-rank Multimodal Fusion With Modality-Specific Factors

Liu, Zhun; Shen, Ying; Lakshminarasimhan, Varun; Liang, Paul Pu; Zadeh, AmirAli Bagher; Morency, Louis–Philippe

doi:10.18653/v1/p18-1209

Cited by 522 publications

(256 citation statements)

References 31 publications

Supporting

Mentioning

233

Contrasting

Order By: Relevance

“…For the task of image-text matching, Wang et al (2017) compare an embedding network that projects texts and photos into a joint space where semantically-similar texts and photos are close to each other, with a similarity network that fuses text embeddings and photo embeddings via element multiplication. For the task of sentiment analysis, ; Ghosal et al (2018); ; Liu et al (2018) propose several models, namely contextual intermodal attention, dynamic fusion graph, and lowrank multimodal fusion, for integrating visual, audio, and text signals on the CMU-MOSEI data set. There is also research initiative in multimodal summarization (Li et al, 2017) and multimodal translation (Calixto et al, 2017;Delbrouck and Dupont, 2017).…”

Section: Introductionmentioning

confidence: 99%

Exploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification

Yang¹,

Xu²,

Ghosh³

et al. 2019

Proceedings of the Third Workshop on Abusive Language Online

View full text Add to dashboard Cite

Interactions among users on social network platforms are usually positive, constructive and insightful. However, sometimes people also get exposed to objectionable content such as hate speech, bullying, and verbal abuse etc. Most social platforms have explicit policy against hate speech because it creates an environment of intimidation and exclusion, and in some cases may promote real-world violence. As users' interactions on today's social networks involve multiple modalities, such as texts, images and videos, in this paper we explore the challenge of automatically identifying hate speech with deep multimodal technologies, extending previous research which mostly focuses on the text signal alone. We present a number of fusion approaches to integrate text and photo signals. We show that augmenting text with image embedding information immediately leads to a boost in performance, while applying additional attention fusion methods brings further improvement.

show abstract

Section: Introductionmentioning

confidence: 99%

Exploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification

Yang¹,

Xu²,

Ghosh³

et al. 2019

Proceedings of the Third Workshop on Abusive Language Online

View full text Add to dashboard Cite

show abstract

“…The tensor fusion approach (TF) computes a tensor including uni-modal, bi-modal, and tri-modal combination information. LMF (Liu et al, 2018) is a tensor fusion method that performs tensor factorization using the same rank for all the modalities in order to reduce the number of parameters. Our proposed method aims to use different factors for each modality.…”

Section: Model Architecturementioning

confidence: 99%

“…Our method, Modality-based Redundancy Reduction multimodal Fusion (MRRF), builds on recent work in mutimodal fusion utilizing first an outer product tensor of input modalities to better capture inter-modality dependencies and a recent approach to reduce the number of elements in the resulting tensor through low rank factorization (Liu et al, 2018). Whereas the factorization used in (Liu et al, 2018) utilizes a single compression rate across all modalities, we instead use Tuckers tensor decomposition (see the Methodology section), which allows different compression rates for each modality. This allows the model to adapt to variations in the amount of useful information between modalities.…”

Section: Introductionmentioning

confidence: 99%

Modality-based Factorization for Multimodal Fusion

Barezi¹,

Fung²

2019

Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

View full text Add to dashboard Cite

We propose a novel method, Modality-based Redundancy Reduction Fusion (MRRF), for understanding and modulating the relative contribution of each modality in multimodal inference tasks. This is achieved by obtaining an (M + 1)-way tensor to consider the high-order relationships between M modalities and the output layer of a neural network model. Applying a modality-based tensor factorization method, which adopts different factors for different modalities, results in removing information present in a modality that can be compensated by other modalities, with respect to model outputs. This helps to understand the relative utility of information in each modality. In addition it leads to a less complicated model with less parameters and therefore could be applied as a regularizer avoiding overfitting. We have applied this method to three different multimodal datasets in sentiment analysis, personality trait recognition, and emotion recognition. We are able to recognize relationships and relative importance of different modalities in these tasks and achieves a 1% to 4% improvement on several evaluation measures compared to the state-of-the-art for all three tasks.

show abstract

“…The most common strategy for joint representation of features is through concatenation. Despite the popularity of this strategy, it has been shown to fail to fully capture cross-modal interactions [14,15]. Consequently, several multimodal feature representation strategies have been proposed for various applications [16,14,15,17].…”

Section: Related Workmentioning

confidence: 99%

Fusion Strategy for Prosodic and Lexical Representations of Word Importance

2019

View full text Add to dashboard Cite

We investigate whether, and if so when, prosodic features in spoken dialogue aid in modeling the importance of words to the overall meaning of a dialogue turn. Starting from the assumption that acoustic-prosodic cues help identify important speech content, we investigate representation architectures that combine lexical and prosodic features and evaluate them for predicting word importance. We propose an attention-based feature fusion strategy and additionally show how the addition of strategic supervision of the attention weights results in especially competitive models. We evaluate our fusion strategy on spoken dialogues and demonstrate performance increases over state-ofthe-art models. Specifically, our approach both achieves the lowest root mean square error on test data and generalizes better over out-of-vocabulary words.

show abstract

Efficient Low-rank Multimodal Fusion With Modality-Specific Factors

Cited by 522 publications

References 31 publications

Exploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification

Exploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification

Modality-based Factorization for Multimodal Fusion

Fusion Strategy for Prosodic and Lexical Representations of Word Importance

Contact Info

Product

Resources

About