Coordinate Attention for Efficient Mobile Network Design

Hou, Qibin; Zhou, Daohong; Feng, Jiashi

doi:10.1109/cvpr46437.2021.01350

Cited by 2,149 publications

(516 citation statements)

References 26 publications

Supporting

Mentioning

511

Contrasting

Unclassified

Order By: Relevance

“…First of all, outlook attention encodes spatial information by measuring the similarity between pairs of token representations, which is more parameter-efficient for feature learning than convolutions, as studied in previous work [37,45]. Second, outlook attention adopts a sliding window mechanism to locally encode token representations at fine level, and to some extent preserves the crucial positional information for vision tasks [25,56]. Third, the way of generating attention weights is simple and efficient.…”

Section: Discussionmentioning

confidence: 99%

VOLO: Vision Outlooker for Visual Recognition

Li¹,

Hou²,

Jiang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Visual recognition has been dominated by convolutional neural networks (CNNs) for years. Though recently the prevailing vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs if no extra data are provided. In this work, we try to close the performance gap and demonstrate that attention-based models are indeed able to outperform CNNs. We find a major factor limiting the performance of ViTs for ImageNet classification is their low efficacy in encoding fine-level features into the token representations. To resolve this, we introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO). Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens, which is shown to be critically beneficial to recognition performance but largely ignored by the self-attention. Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark, without using any extra training data. In addition, the pre-trained VOLO transfers well to downstream tasks, such as semantic segmentation. We achieve 84.3% mIoU score on the cityscapes validation set and 54.3% on the ADE20K validation set. Code is available at https://github.com/sail-sg/volo.

show abstract

Section: Discussionmentioning

confidence: 99%

VOLO: Vision Outlooker for Visual Recognition

Li¹,

Hou²,

Jiang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…BAM and CBAM adopt convolutions to capture local relations, but fail to model long-range dependencies. To solve these problems, Hou et al [130] proposed coordinate attention, a novel attention mechanism which embeds positional information into channel attention, so that the network can focus on large important regions at little computational cost.…”

Section: Coordinate Attentionmentioning

confidence: 99%

“…[ [113][114][115][116] Channel & spatial attention Predict channel and spatial attention masks separately (e.g., [6,117]) or generate a joint 3-D channel, height, width attention mask directly (e.g., [118][119][120]) and use it to select important features. [6,10,13,14,50,101,[117][118][119][121][122][123][124][125][126][127][128][129][130] Spatial & temporal attention…”

Section: Introductionmentioning

confidence: 99%

Attention mechanisms in computer vision: A survey

et al. 2022

View full text Add to dashboard Cite

Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multimodal tasks, and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention, and branch attention; a related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is dedicated to collecting related work. We also suggest future directions for attention mechanism research.

show abstract

“…It can extract important features by assigning weights to each channel but does not learn the importance of location information. Therefore, we embed the coordinate attention (CA) module [47], which can fully perceive position information, into CSAM. The CA module first aggregates features near key points in the image into a pair of key point direction-aware feature maps ( , ,1) (,, ) , 0 , 0…”

Section: Channel Attentionmentioning

confidence: 99%

MAGNet: A camouflaged object detection network simulating the observation effect of a magnifier

Jiang

Cai

Jiang

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, protecting important objects by simulating animal camouflage has been widely used in many fields. Therefore, camouflaged object detection (COD) technology has emerged. COD is more difficult than traditional object detection techniques because of the high degree of fusion of camouflaged objects with the background. In this paper, we strive to more accurately identify camouflaged objects. Inspired by the use of magnifiers to search for hidden objects in pictures, we propose a COD network that simulates the observation effect of a magnifier, called the MAGnifier Network (MAGNet). Specifically, our MAGNet contains two parallel modules, i.e., the ergodic magnification module (EMM) and the attention focus module (AFM). The EMM is designed to mimic the process of a magnifier enlarging an image, and AFM is used to simulate the observation process in which human attention is highly focused on a region. The two sets of output camouflaged object maps are merged to simulate the observation of an object by a magnifier. In addition, a weighted key point area perception loss function, which is more applicable to COD, is designed based on two modules to give higher attention to the camouflaged object. Extensive experiments demonstrate that compared with 14 cutting-edge detection models, MAGNet can achieve the best comprehensive effect on eight evaluation metrics on the public COD dataset, and the segmentation accuracy is significantly improved. We also validate the models' generalization ability on a military camouflaged object dataset constructed in-house. Finally, we experimentally explore some extended applications of COD.

show abstract

Coordinate Attention for Efficient Mobile Network Design

Cited by 2,149 publications

References 26 publications

VOLO: Vision Outlooker for Visual Recognition

VOLO: Vision Outlooker for Visual Recognition

Attention mechanisms in computer vision: A survey

MAGNet: A camouflaged object detection network simulating the observation effect of a magnifier

Contact Info

Product

Resources

About