Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3413977
|View full text |Cite
|
Sign up to set email alerts
|

Deep Multimodal Neural Architecture Search

Abstract: Designing effective neural networks is fundamentally important in deep multimodal learning. Most existing works focus on a single task and design neural architectures manually, which are highly task-specific and hard to generalize to different tasks. In this paper, we devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unifie… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 63 publications
(24 citation statements)
references
References 43 publications
(72 reference statements)
0
18
0
Order By: Relevance
“…(1) The framework does not have special requirements on the input modes, which means we can smoothly choose the inputs from commonly-used modalities such as image, spectrogram, skeleton, etc., according to our needs. (2) Compared with previous works (Pérez-Rúa et al 2019;Yu et al 2020), which aim to search for the entire network architectures, our fusion structures are relatively independent modules. Thus, our SepFusion also possesses good compatibility and can be easily plugged into existing pipelines.…”
Section: Skeleton-guided Sound Separation Stereo Generationmentioning
confidence: 99%
See 2 more Smart Citations
“…(1) The framework does not have special requirements on the input modes, which means we can smoothly choose the inputs from commonly-used modalities such as image, spectrogram, skeleton, etc., according to our needs. (2) Compared with previous works (Pérez-Rúa et al 2019;Yu et al 2020), which aim to search for the entire network architectures, our fusion structures are relatively independent modules. Thus, our SepFusion also possesses good compatibility and can be easily plugged into existing pipelines.…”
Section: Skeleton-guided Sound Separation Stereo Generationmentioning
confidence: 99%
“…Both the appearance and motion modalities belong to the vision domain, while basically larger gaps may appear between cross-censor modalities. Some works try to solve the cross-sensor modality fusion problem by finding ideal feature connections in VQA (Gao et al 2019;Yu et al 2020) and audio-visual classification tasks (Pérez-Rúa et al 2019). These previous works mainly focus on the high-level scenes, but our work can deal with the finegrained dense prediction tasks on the pixel level.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…However, in RGB-D salient object detection, the multimodal feature fusion architectures are still designed by hand. Although there are several NAS works [46,57] for multi-modal fusion, their design purpose is especially for the visual question answering task [57] or image-audio fusion task [46]. As far as we know, our work is the first attempt to utilize the NAS algorithms to tackle the multimodal multi-scale feature fusion problem for RGB-D SOD.…”
Section: Neural Architecture Searchmentioning
confidence: 99%
“…Compared to earlier methods that are only adapted to one V+L task [46,47,49], VLP models is generalizable across multiple tasks and also achieves significantly better performance on respective tasks. Learning fine-grained semantic alignments between image regions and text words plays a key role in V+L tasks.…”
Section: Introductionmentioning
confidence: 99%