“…Vision Transformer(ViT) [18,5] achieved state-of-the-art results on various vision tasks. To increase the convergence speed and improve accuracy, well-explored locality inductive bias have been reintroduced into vision transformer [66,22,62,41,27,61,51,19,56,26], among which, hybrid architecture of convolution and transformer design [49,57,12,21,34] can achieve state-of-the-art performance of a wide range of tasks. Our ConvMAE is highly motivated by the hybrid architecture design [21,34,12,57] in vision backbones.…”