MultiMAE: Multi-modal Multi-task Masked Autoencoders

Bachmann, Roman; Mizrahi, David; Atanov, Andrei; Zamir, Amir

doi:10.1007/978-3-031-19836-6_20

Cited by 61 publications

(26 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Specifically, during pretraining, we partition the training images into patches and feed a portion of them into the encoder following Masked Autoencoder [14]. Our GeoMIM decoder then uses these encoded visible to-kens to reconstruct the pretrained LiDAR model's BEV feature in the BEV space instead of commonly used RGB pixels [47,14,30] or depth points [3] as in existing MAE frameworks. To achieve this PV to BEV reconstruction, we first devise two branches to decouple the semantic and geometric parts, with one branch completing dense PV features and the other reconstructing the depth map.…”

Section: Pretrain Supervision Finetunementioning

confidence: 99%

See 1 more Smart Citation

Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Liu¹,

Wang²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Multi-view camera-based 3D detection is a challenging problem in computer vision. Recent works leverage a pretrained LiDAR detection model to transfer knowledge to a camera-based student network. However, we argue that there is a major domain gap between the LiDAR BEV features and the camera-based BEV features, as they have different characteristics and are derived from different sources. In this paper, we propose Geometry Enhanced Masked Image Modeling (GeoMIM) to transfer the knowledge of the LiDAR model in a pretrain-finetune paradigm for improving the multi-view camera-based 3D detection. GeoMIM is a multicamera vision transformer with Cross-View Attention (CVA) blocks that uses LiDAR BEV features encoded by the pretrained BEV model as learning targets. During pretraining, GeoMIM's decoder has a semantic branch completing dense perspective-view features and the other geometry branch reconstructing dense perspective-view depth maps. The depth branch is designed to be camera-aware by inputting the camera's parameters for better transfer capability. Extensive results demonstrate that GeoMIM outperforms existing methods on nuScenes benchmark, achieving state-of-the-art performance for camera-based 3D object detection and 3D segmentation.

show abstract

Section: Pretrain Supervision Finetunementioning

confidence: 99%

“…Masked Image Modeling Inspired by BERT [11] for Masked Language Modeling, Masked Image Modeling (MIM) becomes a popular pretext task for visual representation learning [6,14,2,46,1,4,51,3,49]. MIM aims to reconstruct the masked tokens from a corrupted input.…”

Section: Related Workmentioning

confidence: 99%

Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Liu¹,

Wang²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…However, such ViT pre-training methods focus on image or video data, without exploring how 3D priors can potentially be exploited. MultiMAE [2] on the other hand introduces depth priors. However, it requires depth as input not only in pre-training but also in downstream tasks.…”

Section: Related Workmentioning

confidence: 99%

Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

Hou¹,

Dai²,

He³

et al. 2023

Preprint

View full text Add to dashboard Cite

Figure 1. We present Mask3D, which learns to embed 3D priors to 2D representations for image understanding tasks, based on a selfsupervised pre-training formulation from single RGB-D views, without requiring any camera pose or multi-view correspondence information. Our pre-training takes masked RGB and depth patches as input to reconstruct the dense depth map, and the pre-trained color backbone is used to fine-tune various downstream image understanding tasks. This results in effective ViT pre-training for a variety of downstream tasks and datasets.

show abstract

“…It masks random patches of the input image, and restore the missing pixels. MAE has been used in many vision tasks [1,34,41]. Motivated by the powerful and robust data generation ability, for the first time we leverage MAE to detect triggers and restore images.…”

Section: Related Workmentioning

confidence: 99%

Mask and Restore: Blind Backdoor Defense at Test Time with Masked Autoencoder

Sun¹,

Pang²,

Chen³

et al. 2023

Preprint

View full text Add to dashboard Cite

Deep neural networks are vulnerable to backdoor attacks, where an adversary maliciously manipulates the model behavior through overlaying images with special triggers. Existing backdoor defense methods often require accessing a few validation data and model parameters, which are impractical in many real-world applications, e.g., when the model is provided as a cloud service. In this paper, we address the practical task of blind backdoor defense at test time, in particular for black-box models. The true label of every test image needs to be recovered on the fly from the hard label predictions of a suspicious model. The heuristic trigger search in image space, however, is not scalable to complex triggers or high image resolution. We circumvent such barrier by leveraging generic image generation models, and propose a framework of Blind Defense with Masked AutoEncoder (BDMAE). It uses the image structural similarity and label consistency between the test image and MAE restorations to detect possible triggers. The detection result is refined by considering the topology of triggers. We obtain a purified test image from restorations for making prediction. Our approach is blind to the model architectures, trigger patterns or image benignity. Extensive experiments on multiple datasets with different backdoor attacks validate its effectiveness and generalizability. Code is available at https://github.com/tsun/BDMAE.

show abstract

MultiMAE: Multi-modal Multi-task Masked Autoencoders

Cited by 61 publications

References 46 publications

Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

Mask and Restore: Blind Backdoor Defense at Test Time with Masked Autoencoder

Contact Info

Product

Resources

About