SimViT: Exploring a Simple Vision Transformer with Sliding Windows

Li, Gang; Xu, Di; Cheng, Xing; Si, Lingyu; Chen, Zheng

doi:10.1109/icme52920.2022.9859907

Cited by 13 publications

(7 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the reduced size of the feature space results in loss of information, so the researchers used Pixel-Shuffle to upsample the attentional output to ensure feature integrity. SimViT [15] is a simplified version of ViT [38] and one of the inspirers of this work. It utilizes a sliding window to sample the input image and constructs a convolution-like approach to better capture spatial structure without introducing positional encoding that requires trainable parameters.…”

Section: Efficient Transformersmentioning

confidence: 99%

“…However, unlike convolution, traditional attention introduces nonlocality. Models can compute attention by focusing on information far from the current position, disrupting the spatial relationships of the inputs [15], they are destroying important information such as the position, shape, and relative relationship of the objects in the image, thus affecting the performance of the downstream task. To this end, a flowchart of proposed in this paper Central-Context Augment is shown in Figure 4.…”

Section: Central-context Augment With Sliding Windowsmentioning

confidence: 99%

“…Inspired by Li et al [15], we propose an efficient tracking architecture based on central attention in this work. Convolution produces similar outputs for patterns at different locations [12] while ensuring that the spatial structure remains unchanged.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Central Attention with Sliding Window for Efficient Visual Tracking

Chen,

Xiao,

Xiong

et al. 2024

Preprint

View full text Add to dashboard Cite

Cross-correlation is often used for feature fusion, especially in Siamese-based trackers. However, capturing complex nonlinear relationships is challenging and susceptible to outliers in the sample. Recently, researchers have used Transformers for feature fusion and achieved more significant performance. However, most rely on modeling global token relationships, which can destroy the local and spatial correlations inherent in 2D structures. This paper proposes an efficient tracking algorithm based on central attention and sliding window sampling called SiamCAT. Specifically, significant context augments with sliding windows are suggested to maintain the stability of the 2D input spatial structure. It is based on attention to simulate the processing of 2D data by convolution, and the internal memory composed of learnable parameters realizes the dynamic adjustment of the attention layer. Second, to learn efficient feature fusion, this paper constructs a feature fusion network to effectively combine template features and search features. Experiments show that SiamCAT achieves state-of-the-art results on LaSOT, OTB100, NFS, UAV123, GOT10K, and TrackingNet benchmark and runs in real-time at 47 frames per second on the CPU. The code will be released in https://github.com/cnchange/SiamCAT.

show abstract

Section: Efficient Transformersmentioning

confidence: 99%

Section: Central-context Augment With Sliding Windowsmentioning

confidence: 99%

See 1 more Smart Citation

Central Attention with Sliding Window for Efficient Visual Tracking

Chen,

Xiao,

Xiong

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Due to the limitations of computing speed, traditional object detection algorithms focus mainly on pixel information in images. Traditional object detection algorithms can be divided into two categories, sliding window-based methods [25] and region proposal-based methods [26]. The sliding window-based approach achieves object detection by sliding windows of different sizes over an image and classifying the contents within the different windows.…”

Section: Object Detectionmentioning

confidence: 99%

An Active Multi-Object Ultrafast Tracking System with CNN-Based Hybrid Object Detection

Shimasaki

et al. 2023

Sensors

View full text Add to dashboard Cite

This study proposes a visual tracking system that can detect and track multiple fast-moving appearance-varying targets simultaneously with 500 fps image processing. The system comprises a high-speed camera and a pan-tilt galvanometer system, which can rapidly generate large-scale high-definition images of the wide monitored area. We developed a CNN-based hybrid tracking algorithm that can robustly track multiple high-speed moving objects simultaneously. Experimental results demonstrate that our system can track up to three moving objects with velocities lower than 30 m per second simultaneously within an 8-m range. The effectiveness of our system was demonstrated through several experiments conducted on simultaneous zoom shooting of multiple moving objects (persons and bottles) in a natural outdoor scene. Moreover, our system demonstrates high robustness to target loss and crossing situations.

show abstract

“…Its main advantage is that it allows the model to simultaneously consider all elements in the sequence when processing the sequence rather than relying only on local or adjacent information, thus facilitating the model to globally model the input image. ViT [16] and its successors [17][18][19][20][21] have demonstrated the potential to tackle vision tasks by processing image patches through Transformers, yet they often require extensive datasets and sophisticated training strategies to achieve competitive performance. Despite their advancements, Transformers face challenges in localized feature extraction and exhibit a quadratic increase in computational complexity with higher image resolutions, which can be impractical for certain applications.…”

Section: Introductionmentioning

confidence: 99%

Fine-Grained Detection Model Based on Attention Mechanism and Multi-Scale Feature Fusion for Cocoon Sorting

Zheng,

Guo,

et al. 2024

Agriculture

View full text Add to dashboard Cite

Sorting unreelable inferior cocoons during the reeling process is essential for obtaining high-quality silk products. At present, silk reeling enterprises mainly rely on manual sorting, which is inefficient and labor-intensive. Automated sorting based on machine vision and sorting robots is a promising alternative. However, the accuracy and computational complexity of object detection are challenges for the practical application of automatic sorting, especially for small stains of inferior cocoons in images of densely distributed cocoons. To deal with this problem, an efficient fine-grained object detection network based on attention mechanism and multi-scale feature fusion, called AMMF-Net, is proposed for inferior silkworm cocoon recognition. In this model, fine-grained object features are key considerations to improve the detection accuracy. To efficiently extract fine-grained features of silkworm cocoon images, we designed an efficient hybrid feature extraction network (HFE-Net) that combines depth-wise separable convolution and Transformer as the backbone. It captures local and global information to extract fine-grained features of inferior silkworm cocoon images, improving the representation ability of the network. An efficient multi-scale feature fusion module (EMFF) is proposed as the neck of the object detection structure. It improves the typical down-sampling method of multi-scale feature fusion to avoid the loss of key information and achieve better performance. Our method is trained and evaluated on a dataset collected from multiple inferior cocoons. Extensive experiments validated the effectiveness and generalization performance of the HFE-Net network and the EMFF module, and the proposed AMMF-Net achieved the best detection results compared to other popular deep neural networks.

show abstract

SimViT: Exploring a Simple Vision Transformer with Sliding Windows

Cited by 13 publications

References 12 publications

Central Attention with Sliding Window for Efficient Visual Tracking

Central Attention with Sliding Window for Efficient Visual Tracking

An Active Multi-Object Ultrafast Tracking System with CNN-Based Hybrid Object Detection

Fine-Grained Detection Model Based on Attention Mechanism and Multi-Scale Feature Fusion for Cocoon Sorting

Contact Info

Product

Resources

About