MFAS: Multimodal Fusion Architecture Search

Pérez-Rúa, Juan-Manuel; Vielzeuf, Valentin; Pateux, Stéphane; Baccouche, Moez; Jurie, Frédéric

doi:10.1109/cvpr.2019.00713

Cited by 151 publications

(123 citation statements)

References 37 publications

Supporting

Mentioning

123

Contrasting

Order By: Relevance

“…• For weighted sum with scalar weights, an iterative method is proposed [125] that requires the pre-trained vector representations for each modality to have the same number of elements arranged in an order that is suitable for element-wise addition. This is often achieved by jointly training a fully connected layer for dimension control and reordering for each modality, together with the scalar weights for fusion.…”

Section: A Simple Operation-based Fusionmentioning

confidence: 99%

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang

Yang

et al. 2020

IEEE J. Sel. Top. Signal Process.

262

View full text Add to dashboard Cite

Deep learning has revolutionized speech recognition, image recognition, and natural language processing since 2010, each involving a single modality in the input signal. However, many applications in artificial intelligence involve more than one modality. It is therefore of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, a technical review of the models and learning methods for multimodal intelligence is provided. The main focus is the combination of vision and natural language, which has become an important area in both computer vision and natural language processing research communities.This review provides a comprehensive analysis of recent work on multimodal deep learning from three new angles -learning multimodal representations, the fusion of multimodal signals at various levels, and multimodal applications. On multimodal representation learning, we review the key concept of embedding, which unifies the multimodal signals into the same vector space and thus enables cross-modality signal processing. We also review the properties of the many types of embedding constructed and learned for general downstream tasks. On multimodal fusion, this review focuses on special architectures for the integration of the representation of unimodal signals for a particular task. On applications, selected areas of a broad interest in current literature are covered, including caption generation, text-to-image generation, and visual question answering. We believe this review can facilitate future studies in the emerging field of multimodal intelligence for the community.

show abstract

Section: A Simple Operation-based Fusionmentioning

confidence: 99%

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Zhang

Yang

et al. 2020

IEEE J. Sel. Top. Signal Process.

262

View full text Add to dashboard Cite

show abstract

“…Recently, one-shot NAS methods have been proposed to eliminate the meta-controller by modeling the NAS problem as a single training process of an over-parameterized supernet that comprises all candidate paths [5,7,32,52]. The most closely related study to our work is the MFAS approach [39], which also incorporates NAS to search the optimal architecture for multimodal tasks. However, MFAS focuses on a simpler problem to search for the multimodal fusion model given two input features, which cannot be directly used to address the multimodal learning tasks in this paper.…”

Section: Related Workmentioning

confidence: 99%

Deep Multimodal Neural Architecture Search

Cui

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Designing effective neural networks is fundamentally important in deep multimodal learning. Most existing works focus on a single task and design neural architectures manually, which are highly task-specific and hard to generalize to different tasks. In this paper, we devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone, where each encoder or decoder block corresponds to an operation searched from a predefined operation pool. On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks. By using a gradientbased NAS algorithm, the optimal architectures for different tasks are learned efficiently. Extensive ablation studies, comprehensive analysis, and comparative experimental results show that the obtained MMnasNet significantly outperforms existing state-ofthe-art approaches across three multimodal learning tasks (over five datasets), including visual question answering, image-text matching, and visual grounding. CCS CONCEPTS • Computing methodologies → Multi-task learning; Neural networks.

show abstract

“…Vielzeuf et al proposed CentralNet, which converges different modality features step by step by using several levels of interim features available in each modality network [35]. There was also an attempt to use reinforcement learning-based AutoML to find the optimal fusion network architecture [38]. AutoML is effective in finding the optimal combination of hyper-parameters from each network layer and the layer from which the features of each modality are extracted.…”

Section: Multimodal Deep Learningmentioning

confidence: 99%