SwinTrack: A Simple and Strong Baseline for Transformer Tracking

Liu, Lin; Fan, Heng; Xu, Yong; Ling, Haibin

doi:10.48550/arxiv.2112.00995

Cited by 26 publications

(61 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, Liang et al presented SwinIR [20] for image restoration, which first proposed the convolutional layer to extract shallow features, and then adopted Swin transformer for deep feature extraction. Lin et al introduced SwinTrack [21] to interact with the target object and search region for tracking. However, few studies have developed transformer into image fusion fields.…”

Section: A Transformer In Vision Tasksmentioning

confidence: 99%

SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images

Wang¹,

Chen²,

Shao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Section: A Transformer In Vision Tasksmentioning

confidence: 99%

SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images

Wang¹,

Chen²,

Shao³

et al. 2022

Preprint

View full text Add to dashboard Cite

“…Following the previous methods [51,12], we train the models on the train splits of four datasets GOT10k [36], TrackingNet [59], LaSOT [20], and COCO [52] and report the success score (SUC) for the TrackingNet dataset and LaSOT dataset, and the average overlap (AO) for GOT10k. We use the SwinTrack [51] to train and evaluate our pre-trained models with the same data augmentations, training, and inference settings. We sample 131072 pairs per epoch and train the models for 300 epochs.…”

Section: Geometric and Motion Tasksmentioning

confidence: 99%

“…For the video object tracking, MIM models also show a stronger transfer ability over supervised pretrained models. On the long-term dataset LaSOT, SwinTrack [51] with MIM pre-trained SwinV2-B backbone achieves comparable result with the SOTA MixFormer-L [12] with a larger image size 320 × 320. We obtain the best SUC of 70.7 on the LaSOT with SwinV2-L backbone with the input image size 224 × 224 and template size 112 × 112.…”

Section: Geometric and Motion Tasksmentioning

confidence: 99%

“…We use the SwinTrack [51] to evaluate our pre-trained models. The data augmentations and the training settings Strictly follow SwinTrack [51].…”

Section: E Detailed Settingsmentioning

confidence: 99%

“…We use the SwinTrack [51] to evaluate our pre-trained models. The data augmentations and the training settings Strictly follow SwinTrack [51]. We sample 131072 pairs per epoch and train the models for 300 epochs.…”

Section: E Detailed Settingsmentioning

confidence: 99%

See 2 more Smart Citations

Revealing the Dark Secrets of Masked Image Modeling

Xie¹,

Geng²,

Hu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction. * Equal Contribution. The work is done when Zhenda Xie, Zigang Geng, and Jingcheng Hu are long-term interns at Microsoft Research Asia. † Contact person.

show abstract

Transformer tracking with multi-scale dual-attention

Wang

Lai

Zhang

et al. 2023

Complex Intell. Syst.

View full text Add to dashboard Cite

Transformer-based trackers greatly improve tracking success rate and precision rate. Attention mechanism in Transformer can fully explore the context information across successive frames. Nevertheless, it ignores the equally important local information and structured spatial information. And irrelevant regions may also affect the template features and search region features. In this work, a multi-scale feature fusion network is designed with box attention and instance attention in Encoder–Decoder architecture based on Transformer. After extracting features, the local information and structured spatial information is learnt by multi-scale box attention, and the global context information is explored by instance attention. Box attention samples grid features from the region of interest. Therefore, it effectively focuses on the region of interest (ROI) and avoids the influence of irrelevant regions in feature extraction. At the same time, instance attention can also pay attention to the context information across successive frames, and avoid falling into local optimum. The long-range feature dependencies are learned in this stage. Extensive experiments are conducted on six challenging tracking datasets to demonstrate the superiority of the proposed tracker MDTT, including UAV123, GOT-10k, LaSOT, VOT2018, TrackingNet, and NfS. In particular, the proposed tracker achieves AUC score of $$64.7 \% $$ 64.7 % on LaSOT, $$78.1 \%$$ 78.1 % on TrackingNet and precision score of $$89.2 \%$$ 89.2 % on UAV123, which outperforms the baseline and most recent advanced trackers.

show abstract

SwinTrack: A Simple and Strong Baseline for Transformer Tracking

Cited by 26 publications

References 0 publications

SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images

SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images

Revealing the Dark Secrets of Masked Image Modeling

Transformer tracking with multi-scale dual-attention

Contact Info

Product

Resources

About