2022
DOI: 10.1007/978-3-031-19836-6_20
|View full text |Cite
|
Sign up to set email alerts
|

MultiMAE: Multi-modal Multi-task Masked Autoencoders

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
22
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 61 publications
(26 citation statements)
references
References 46 publications
0
22
1
Order By: Relevance
“…Specifically, during pretraining, we partition the training images into patches and feed a portion of them into the encoder following Masked Autoencoder [14]. Our GeoMIM decoder then uses these encoded visible to-kens to reconstruct the pretrained LiDAR model's BEV feature in the BEV space instead of commonly used RGB pixels [47,14,30] or depth points [3] as in existing MAE frameworks. To achieve this PV to BEV reconstruction, we first devise two branches to decouple the semantic and geometric parts, with one branch completing dense PV features and the other reconstructing the depth map.…”
Section: Pretrain Supervision Finetunementioning
confidence: 99%
See 1 more Smart Citation
“…Specifically, during pretraining, we partition the training images into patches and feed a portion of them into the encoder following Masked Autoencoder [14]. Our GeoMIM decoder then uses these encoded visible to-kens to reconstruct the pretrained LiDAR model's BEV feature in the BEV space instead of commonly used RGB pixels [47,14,30] or depth points [3] as in existing MAE frameworks. To achieve this PV to BEV reconstruction, we first devise two branches to decouple the semantic and geometric parts, with one branch completing dense PV features and the other reconstructing the depth map.…”
Section: Pretrain Supervision Finetunementioning
confidence: 99%
“…Masked Image Modeling Inspired by BERT [11] for Masked Language Modeling, Masked Image Modeling (MIM) becomes a popular pretext task for visual representation learning [6,14,2,46,1,4,51,3,49]. MIM aims to reconstruct the masked tokens from a corrupted input.…”
Section: Related Workmentioning
confidence: 99%
“…However, such ViT pre-training methods focus on image or video data, without exploring how 3D priors can potentially be exploited. MultiMAE [2] on the other hand introduces depth priors. However, it requires depth as input not only in pre-training but also in downstream tasks.…”
Section: Related Workmentioning
confidence: 99%
“…It masks random patches of the input image, and restore the missing pixels. MAE has been used in many vision tasks [1,34,41]. Motivated by the powerful and robust data generation ability, for the first time we leverage MAE to detect triggers and restore images.…”
Section: Related Workmentioning
confidence: 99%