MultiMAE: Multi-modal Multi-task Masked Autoencoders

Bachmann, Roman; Mizrahi, David; Atanov, Andrei; Zamir, Amir

doi:10.48550/arxiv.2204.01678

Cited by 14 publications

(15 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It would also be interesting to study introducing auxiliary prediction for other modalities, such as audio. Another weakness is that our model operates only on RGB pixels from a single camera viewpoint; we look forward to a future work that incorporates different input modalities such as proprioceptive states and point clouds, building on top of the recent multi-modal learning approaches [52,53]. Finally, our approach trains behaviors from scratch, which makes it still too sample-inefficient to be used in real-world scenarios.…”

Section: Discussionmentioning

confidence: 99%

Masked World Models for Visual Control

Seo¹,

Hafner²,

Líu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient robot learning from visual observations. Yet the current approaches typically train a single model end-to-end for learning both visual representations and dynamics, making it difficult to accurately model the interaction between robots and small objects. In this work, we introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. Specifically, we train an autoencoder with convolutional layers and vision transformers (ViT) to reconstruct pixels given masked convolutional features, and learn a latent dynamics model that operates on the representations from the autoencoder. Moreover, to encode task-relevant information, we introduce an auxiliary reward prediction objective for the autoencoder. We continually update both autoencoder and dynamics model using online samples collected from environment interaction. We demonstrate that our decoupling approach achieves state-of-the-art performance on a variety of visual robotic tasks from Meta-world and RLBench, e.g., we achieve 81.7% success rate on 50 visual robotic manipulation tasks from Meta-world, while the baseline achieves 67.9%. Code is available on the project website: https://sites.google.com/view/mwm-rl.

show abstract

Section: Discussionmentioning

confidence: 99%

Masked World Models for Visual Control

Seo¹,

Hafner²,

Líu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In computer vision, a popular method for MTL is to employ a single encoder to learn a shared representation, followed by numerous task-specific decoders [14,15]. In this paper, a similar strategy is employed by training one main backbone model together with several small task-specific heads.…”

Section: Multi-task Learning and Self-trainingmentioning

confidence: 99%

“…Pseudo labeling is a one-time preprocessing method applicable to RGB datasets of variable size. Compared to the training cost, this phase is computationally inexpensive [14].…”

Section: Pseudo-labeled Multi-task Trainingmentioning

confidence: 99%

ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection

Ma¹,

Wang²

2022

Preprint

View full text Add to dashboard Cite

Driver distraction detection is an important computer vision problem that can play a crucial role in enhancing traffic safety and reducing traffic accidents. This paper proposes a novel semi-supervised method for detecting driver distractions based on Vision Transformer (ViT). Specifically, a multi-modal Vision Transformer (ViT-DD) is developed that makes use of inductive information contained in training signals of distraction detection as well as driver emotion recognition. Further, a self-learning algorithm is designed to include driver data without emotion labels into the multi-task training of ViT-DD. Extensive experiments conducted on the SFDDD and AUCDD datasets demonstrate that the proposed ViT-DD outperforms the best state-of-the-art approaches for driver distraction detection by 6.5% and 0.9%, respectively. Our source code is released at https://github.com/PurdueDigitalTwin/ViT-DD.

show abstract

“…GMAE [36] adapts MAE to the domain of graphs. MultiMAE [37] enhance the flexibility of MAE by enabling it to take optional input of different modality and correspondingly adding other training objectives to facilitate multi-modality learning. However, these works fail to handle temporal and multi-spectral input.…”

Section: Introductionmentioning

confidence: 99%

SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery

Cong¹,

Samar²,

Meng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Unsupervised pre-training methods for large vision models have shown to enhance performance on downstream supervised tasks. Developing similar techniques for satellite imagery presents significant opportunities as unlabelled data is plentiful and the inherent temporal and multi-spectral structure provides avenues to further improve existing pre-training strategies. In this paper, we present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE). To leverage temporal information, we include a temporal embedding along with independently masking image patches across time. In addition, we demonstrate that encoding multi-spectral data as groups of bands with distinct spectral positional encodings is beneficial. Our approach yields strong improvements over previous state-of-the-art techniques, both in terms of supervised learning performance on benchmark datasets (up to ↑ 7%), and transfer learning performance on downstream remote sensing tasks, including land cover classification (up to ↑ 14%) and semantic segmentation. * Equal contribution. Order determined via coin flip.Preprint. Under review.

show abstract

MultiMAE: Multi-modal Multi-task Masked Autoencoders

Cited by 14 publications

References 0 publications

Masked World Models for Visual Control

Masked World Models for Visual Control

ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection

SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery

Contact Info

Product

Resources

About