2022 IEEE International Conference on Multimedia and Expo (ICME) 2022
DOI: 10.1109/icme52920.2022.9859907
|View full text |Cite
|
Sign up to set email alerts
|

SimViT: Exploring a Simple Vision Transformer with Sliding Windows

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(7 citation statements)
references
References 12 publications
0
7
0
Order By: Relevance
“…However, the reduced size of the feature space results in loss of information, so the researchers used Pixel-Shuffle to upsample the attentional output to ensure feature integrity. SimViT [15] is a simplified version of ViT [38] and one of the inspirers of this work. It utilizes a sliding window to sample the input image and constructs a convolution-like approach to better capture spatial structure without introducing positional encoding that requires trainable parameters.…”
Section: Efficient Transformersmentioning
confidence: 99%
See 2 more Smart Citations
“…However, the reduced size of the feature space results in loss of information, so the researchers used Pixel-Shuffle to upsample the attentional output to ensure feature integrity. SimViT [15] is a simplified version of ViT [38] and one of the inspirers of this work. It utilizes a sliding window to sample the input image and constructs a convolution-like approach to better capture spatial structure without introducing positional encoding that requires trainable parameters.…”
Section: Efficient Transformersmentioning
confidence: 99%
“…However, unlike convolution, traditional attention introduces nonlocality. Models can compute attention by focusing on information far from the current position, disrupting the spatial relationships of the inputs [15], they are destroying important information such as the position, shape, and relative relationship of the objects in the image, thus affecting the performance of the downstream task. To this end, a flowchart of proposed in this paper Central-Context Augment is shown in Figure 4.…”
Section: Central-context Augment With Sliding Windowsmentioning
confidence: 99%
See 1 more Smart Citation
“…Due to the limitations of computing speed, traditional object detection algorithms focus mainly on pixel information in images. Traditional object detection algorithms can be divided into two categories, sliding window-based methods [25] and region proposal-based methods [26]. The sliding window-based approach achieves object detection by sliding windows of different sizes over an image and classifying the contents within the different windows.…”
Section: Object Detectionmentioning
confidence: 99%
“…Its main advantage is that it allows the model to simultaneously consider all elements in the sequence when processing the sequence rather than relying only on local or adjacent information, thus facilitating the model to globally model the input image. ViT [16] and its successors [17][18][19][20][21] have demonstrated the potential to tackle vision tasks by processing image patches through Transformers, yet they often require extensive datasets and sophisticated training strategies to achieve competitive performance. Despite their advancements, Transformers face challenges in localized feature extraction and exhibit a quadratic increase in computational complexity with higher image resolutions, which can be impractical for certain applications.…”
Section: Introductionmentioning
confidence: 99%