Visual Attention Network

Guo, Minghao; Lu, Chin-Pi; Liu, Zheng-Ning; Cheng, Ming–Ming; Hu, Shi‐Min

doi:10.48550/arxiv.2202.09741

Cited by 63 publications

(116 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[ [113][114][115][116] Channel & spatial attention Predict channel and spatial attention masks separately (e.g., [6,117]) or generate a joint 3-D channel, height, width attention mask directly (e.g., [118][119][120]) and use it to select important features. [6,10,13,14,50,101,[117][118][119][121][122][123][124][125][126][127][128][129][130] Spatial & temporal attention…”

Section: Introductionmentioning

confidence: 99%

Attention mechanisms in computer vision: A survey

et al. 2022

Self Cite

View full text Add to dashboard Cite

Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multimodal tasks, and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention, and branch attention; a related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is dedicated to collecting related work. We also suggest future directions for attention mechanism research.

show abstract

Section: Introductionmentioning

confidence: 99%

Attention mechanisms in computer vision: A survey

et al. 2022

Self Cite

View full text Add to dashboard Cite

show abstract

“…Concurrent works. We notice three concurrent works, including ConvNeXT [42], RepLKNet [14] and Visual Attention Networks (VAN) [20]. All these works are motivated by large receptive field and exploit convolutions with large or dilated kernels as the main building block.…”

Section: Convolutionsmentioning

confidence: 99%

Focal Modulation Networks

Yang¹,

Li²,

Gao³

2022

Preprint

View full text Add to dashboard Cite

In this work, we propose focal modulation network (FocalNet in short), where self-attention (SA) is completely replaced by a focal modulation module that is more effective and efficient for modeling token interactions. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges at different granularity levels, (ii) gated aggregation to selectively aggregate context features for each visual token (query) based on its content, and (iii) modulation or element-wise affine transformation to fuse the aggregated features into the query vector. Extensive experiments show that FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin Transformers) with similar time and memory cost on the tasks of image classification, object detection, and semantic segmentation. Specifically, our FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 2 and 384 2 , respectively. FocalNets exhibit remarkable superiority when transferred to downstream tasks. For object detection with Mask R-CNN, our FocalNet base trained with 1× already surpasses Swin trained with 3× schedule (49.0 v.s. 48.5). For semantic segmentation with UperNet, FocalNet base evaluated at single-scale outperforms Swin evaluated at multi-scale (50.5 v.s. 49.7). These results render focal modulation a favorable alternative to SA for effective and efficient visual modeling in real-world applications.Code is available at: https://github.com/microsoft/FocalNet.

show abstract

“…After that, more attention mechanisms [38]- [41] were proposed, such as self-attention [42] and channel attention [38]. Nowadays, attention mechanisms have been applied in many visual tasks [37]- [41], [43], [44]. As for self-supervised monocular depth estimation, attention based network has been applied in [19]- [21].…”

Section: Attention Mechanismmentioning

confidence: 99%

“…For the encoder, we employ a visual attention network (VAN) [44] to extract multi-scale feature maps X e i , i = 1, 2, 3, 4. A VAN has four stages, where spatial adaptability and channel adaptability are efficiently implemented by the large kernel attention.…”

Section: B Vadepth Network Architecturementioning

confidence: 99%

Visual Attention-based Self-supervised Absolute Depth Estimation using Geometric Priors in Autonomous Driving

Xiang¹,

Wang²,

An³

et al. 2022

Preprint

View full text Add to dashboard Cite

Although existing monocular depth estimation methods have made great progress, predicting an accurate absolute depth map from a single image is still challenging due to the limited modeling capacity of networks and the scale ambiguity issue. In this paper, we introduce a fully Visual Attention-based Depth (VADepth) network, where spatial attention and channel attention are applied to all stages. By continuously extracting the dependencies of features along the spatial and channel dimensions over a long distance, VADepth network can effectively preserve important details and suppress interfering features to better perceive the scene structure for more accurate depth estimates. In addition, we utilize geometric priors to form scale constraints for scale-aware model training. Specifically, we construct a novel scale-aware loss using the distance between the camera and a plane fitted by the ground points corresponding to the pixels of the rectangular area in the bottom middle of the image. Experimental results on the KITTI dataset show that this architecture achieves the state-of-theart performance and our method can directly output absolute depth without post-processing. Moreover, our experiments on the SeasonDepth dataset also demonstrate the robustness of our model to multiple unseen environments.

show abstract

Visual Attention Network

Cited by 63 publications

References 0 publications

Attention mechanisms in computer vision: A survey

Attention mechanisms in computer vision: A survey

Focal Modulation Networks

Visual Attention-based Self-supervised Absolute Depth Estimation using Geometric Priors in Autonomous Driving

Contact Info

Product

Resources

About