2023
DOI: 10.1007/s41095-023-0364-2
|View full text |Cite
|
Sign up to set email alerts
|

Visual attention network

Abstract: While originally designed for natural language processing tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision: (1) treating images as 1D sequences neglects their 2D structures; (2) the quadratic complexity is too expensive for high-resolution images; (3) it only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel linear … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
26
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 196 publications
(55 citation statements)
references
References 106 publications
0
26
0
Order By: Relevance
“…To recognize images accurately, researchers have proposed various architectures and techniques for CNNs, such as using multiple layers, 23 skip connections, 24 dense connections, 25 squeeze and excitation steps, 32 attention mechanisms, 33 and large kernel attention. 34 To remedy the limitations of the local inductive bias in modeling the global representations, transformer-based networks (e.g., CvT-13, 28 Swin Transformer, 31 ViT-B/16, 29 PVT, 30 PoolFormer-S12, 35 and BEiT-B 36 ) are proposed to model the long-range dependencies in feature space via a self-attention mechanism. However, the aforementioned networks are prone to overfitting when trained from scratch on few-shot samples.…”
Section: Image Recognition Techniquesmentioning
confidence: 99%
See 2 more Smart Citations
“…To recognize images accurately, researchers have proposed various architectures and techniques for CNNs, such as using multiple layers, 23 skip connections, 24 dense connections, 25 squeeze and excitation steps, 32 attention mechanisms, 33 and large kernel attention. 34 To remedy the limitations of the local inductive bias in modeling the global representations, transformer-based networks (e.g., CvT-13, 28 Swin Transformer, 31 ViT-B/16, 29 PVT, 30 PoolFormer-S12, 35 and BEiT-B 36 ) are proposed to model the long-range dependencies in feature space via a self-attention mechanism. However, the aforementioned networks are prone to overfitting when trained from scratch on few-shot samples.…”
Section: Image Recognition Techniquesmentioning
confidence: 99%
“…We used ViT-B/16-B/32 4 as our CLIP network architecture. We compared TEG to the SOTA image recognition methods in various few-shot settings, including CNN-based networks (VGG-11, 23 VGG-19, 23 ConvNeXt-T, 64 and VAN-B2 34 ), Transformer-based networks (ViT-B/16, 29 CvT-13, 28 Swin Transformer 31 (Swin-T), PoolFormer-S12, 35 BEiT-B, 36 and EfficientFormer-L1 65 ), as well as CLIP-based fine-tuning methods (zeroshot CLIP, 4 linear-probe CLIP, 4 CoOp, 5 and WiSE-FT (linear classifier, α ¼ 0.5) 48 ). All the compared models were implemented using the PyTorch framework.…”
Section: Vegetablementioning
confidence: 99%
See 1 more Smart Citation
“…We employ contemporary strategies that synergize with DenseNets as well. Our methodology eventually exceeds strong modern architectures [21,25,42,45,57,97] and some milestones like Swin Transformer [47], ConvNeXt [48], and DeiT-III [71] in performance trade-offs on ImageNet-1K [59]. Our models demonstrate competitive performance on downstream tasks such as ADE20K semantic segmentation and COCO object detection/instance segmentation.…”
Section: Introductionmentioning
confidence: 97%
“…The attention mechanism plays a significant role in various domains of machine learning, including Natural Language Processing (NLP) and Computer Vision (CV), 16–23 among others. Broadly speaking, attention can be regarded as a tool for directing available processing resources towards the most informative elements of an input signal 24,25 .…”
Section: Introductionmentioning
confidence: 99%