“…Many recent works have introduced convolutional layers in ViT architecture to form hybrid networks to improve performance, achieve sample efficiency and make the models more efficient in terms of parameters and FLOPs like MobileViTs (MobileViTv1 (Mehta & Rastegari, 2021), Mo-bileViTv2 (Mehta & Rastegari, 2022)), CMT (Guo et al, 2022), CvT (Wu et al, 2021), PVTv2 , ResT , MobileFormer , CPVT (Chu et al, 2021), MiniViT , CoAtNet , CoaT (Xu et al, 2021a). Performance of many of these models on ImageNet-1K, with parameters and FLOPs is shown in Figure 1.…”