Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks

Guo, Minghao; Liu, Zheng-Ning; Mu, Tai‐Jiang; Hu, Shi‐Min

doi:10.48550/arxiv.2105.02358

Cited by 42 publications

(41 citation statements)

References 78 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, since ViT lacks intrinsic inductive bias in modeling local visual structures, it indeed learns the IB from amounts of data implicitly. Following works along this direction are to simplify the model structures with fewer intrinsic IBs and directly learn them from large scale data [42,63,64,18,15], which have achieved promising results and been studied actively. Another direction is to leverage the intrinsic IB from CNNs to facilitate the training of vision transformers, e.g., using less training data or shorter training schedules.…”

Section: Vision Transformers With Learned Ibmentioning

confidence: 99%

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism. Nevertheless, vision transformers treat an image as 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance. Alternatively, they require large-scale training data and longer training schedules to learn the IB implicitly. In this paper, we propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE. Technically, ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context by using multiple convolutions with different dilation rates. In this way, it acquires an intrinsic scale invariance IB and is able to learn robust feature representation for objects at various scales. Moreover, in each transformer layer, ViTAE has a convolution block in parallel to the multi-head selfattention module, whose features are fused and fed into the feed-forward network. Consequently, it has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively. Experiments on ImageNet as well as downstream tasks prove the superiority of ViTAE over the baseline transformer and concurrent works. Source code and pretrained models will be available at code.

show abstract

Section: Vision Transformers With Learned Ibmentioning

confidence: 99%

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Zhang

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To avoid the drawbacks of the aforementioned learning architectures, and, with the aim of achieving better results at lower computational cost, very recently, four architectures were proposed almost simultaneously [16,7,12,17]. Their common aim is to take full advantage of linear layers.…”

Section: Four Recent Architecturesmentioning

confidence: 99%

“…External attention [7] reveals the relation between selfattention and linear layers. It first simplifies self-attention as in Eq.…”

Section: External Attentionmentioning

confidence: 99%

“…Based on the external attention, Guo et al [7] also design a multi-head external attention and achieve an all MLP architecture named EAMLP.…”

Section: External Attentionmentioning

confidence: 99%

“…

…”

mentioning

confidence: 99%

See 2 more Smart Citations

Can Attention Enable MLPs To Catch Up With CNNs?

Guo

Liu

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In the first week of May, 2021, researchers from four different institutions: Google, Tsinghua University, Oxford University and Facebook, shared their latest work [16,7,12,17] on arXiv.org almost at the same time, each proposing new learning architectures, consisting mainly of linear layers, claiming them to be comparable, or even superior to convolutional-based models. This sparked immediate discussion and debate in both academic and industrial communities as to whether MLPs are sufficient, many thinking that learning architectures are returning to MLPs. Is this true?In this perspective, we give a brief history of learning architectures, including multilayer perceptrons (MLPs), convolutional neural networks (CNNs) and transformers. We then examine what the four newly proposed architectures have in common. Finally, we give our views on challenges and directions for new learning architectures, hoping to inspire future research.

show abstract

Cascaded networks for the embryo classification on microscopic images using the residual external‐attention

Guo

Liu

Gong

et al. 2022

Int J Imaging Syst Tech

View full text Add to dashboard Cite

Embryo assessment and selection are usually based on the visual morphological analysis by expert embryologists. Although the embryologist assessment has been routinely used in clinical practice, it is highly dependent on the embryologist's experience and is very time‐consuming. Therefore, objective and efficient methods for automated embryo evaluation are in high demand. We proposed a framework of cascaded networks to hierarchically extract and integrate the microscopic image features for embryo classification. The cascaded networks consisted of a coarse network and a refined network. The coarse network produced a classification activation mapping (CAM) with the highest classification probability, which indicated the most discriminative regions of embryo classification. The refined network extracted and integrated the image features again by using both the CAMs and the corresponding original images. In addition, the residual external‐attention block (ResEA) was used in the refined network to better capture long‐range dependencies. Our cascaded networks were trained on a dataset of 7728 microscopic images of day 3 embryos from 1800 couples and evaluated on an independent testing dataset of 734 microscopic images. The accuracy, sensitivity, specificity, precision, and F1‐score were employed to evaluate the performance of our cascaded networks. Compared with the coarse network and the refined network, respectively, the cascaded networks without the ResEA improved the classification results of embryos. The ResEA block helped the cascaded networks to further improve all five metrics for better embryo classification. Our proposed cascaded networks also achieved better classification results than a junior embryologist did. The cascaded networks hierarchically make full use of image features for more effective learning, and the ResEA further improves the performance of embryo classification.

show abstract

Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks

Cited by 42 publications

References 78 publications

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Can Attention Enable MLPs To Catch Up With CNNs?

Cascaded networks for the embryo classification on microscopic images using the residual external‐attention

Contact Info

Product

Resources

About