Learning Spatio-Temporal Transformer for Visual Tracking

Yan, Bin; Peng, Houwen; Fu, Jianlong; Wang, Dong; Lu, Huchuan

doi:10.1109/iccv48922.2021.01028

Cited by 566 publications

(398 citation statements)

References 42 publications

Supporting

Mentioning

317

Contrasting

Order By: Relevance

“…Thus, it becomes the winner of VOT-RT2021. In the real-time track, TransT-M performs 1.9% higher than the second best tracker STARK [28], which is also a transformerbased tracker. The sequence of VOT is difficult, it has many appearance changes and similar target interference.…”

Section: Evaluation On Votmentioning

confidence: 99%

“…At the same time, [27] also employed Transformer and combined it with SiameseRPN [4] and DiMP [12] as a feature enhancement module to improve the performance of the tracker rather than replace the correlation. Stark [28] proposes another transformer tracking framework by concatenating the search region and the template. It also employs the corner prediction head to improve the accuracy of the bounding box prediction and the dynamic template to fuse the temporal information.…”

Section: Related Workmentioning

confidence: 99%

“…[27] employs Transformer and combines it with SiameseRPN [4] and DiMP [12] as a feature enhancement module. Stark [28] proposes another transformer tracking framework by concatenating the search region and the template. In this work, we extend and improve TransT.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

High-Performance Transformer Tracking

Chen¹,

Yan²,

Zhu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Correlation has a critical role in the tracking field, especially in recent popular Siamese-based trackers. The correlation operation is a simple fusion manner to consider the similarity between the template and the search region. However, the correlation operation is a local linear matching process, losing semantic information and falling into local optimum easily, which may be the bottleneck of designing high-accuracy tracking algorithms. In this work, to determine whether a better feature fusion method exists than correlation, a novel attention-based feature fusion network, inspired by Transformer, is presented. This network effectively combines the template and the search region features using attention. Specifically, the proposed method includes an ego-context augment module based on self-attention and a cross-feature augment module based on cross-attention. First, we present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head. Based on the TransT baseline, we further design a segmentation branch to generate an accurate mask. Finally, we propose a stronger version of TransT by extending TransT with a multi-template design and an IoU prediction head, named TransT-M. Experiments show that our TransT and TransT-M methods achieve promising results on seven popular datasets. Code and models are available at https://github.com/chenxin-dlut/TransT-M.

show abstract

Section: Evaluation On Votmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

High-Performance Transformer Tracking

Chen¹,

Yan²,

Zhu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Most work in tracking focus on discriminative tracking [82], [111], [112], [113] by employing a Transformer to spatially relate the tracked object to it surroundings, effectively leveraging the global attention to discriminate between tracked object and background. Since the Transformer relies on an accurate representation, the template feature used for discrimination is progressively updated with a moving average [111], [113]. Alternatively the Transformer can be used to attend objects which interact with the tracked object and use that to infer tracking and predict movements of occluded actors and/or objects [63].…”

Section: S23 Trackingmentioning

confidence: 99%

Video Transformers: A Survey

Selva¹,

Johansen²,

Escalera³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformer models have shown great success modeling long-range interactions. Nevertheless, they scale quadratically with input length and lack inductive biases. These limitations can be further exacerbated when dealing with the high dimensionality of video. Proper modeling of video, which can span from seconds to hours, requires handling long-range interactions. This makes Transformers a promising tool for solving video related tasks, but some adaptations are required. While there are previous works that study the advances of Transformers for vision tasks, there is none that focus on in-depth analysis of video-specific designs. In this survey we analyse and summarize the main contributions and trends for adapting Transformers to model video data. Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones to reduce dimensionality and a predominance of patches and frames as tokens. Furthermore, we study how the Transformer layer has been tweaked to handle longer sequences, generally by reducing the number of tokens in single attention operation. Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches. Finally, we explore how other modalities are integrated with video and conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D CNN counterparts with equivalent FLOPs and no significant parameter increase.

show abstract

“…Recently, transformer [33] has been successfully applied in many vision tasks [9,22,3]. In the tracking field, transformer also boosts the performance [4,35,40]. However, transformer entails high seriality and its computational amount is proportional to the square of the number of input tokens.…”

Section: Introductionmentioning

confidence: 99%

Efficient Visual Tracking via Hierarchical Cross-Attention Transformer

Chen¹,

Wang²,

Li³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

In recent years, target tracking has made great progress in accuracy. This development is mainly attributed to powerful networks (such as transformers) and additional modules (such as online update and refinement modules). However, less attention has been paid to tracking speed. Most state-of-the-art trackers are satisfied with the real-time speed on powerful GPUs. However, practical applications necessitate higher requirements for tracking speed, especially when edge platforms with limited resources are used. In this work, we present an efficient tracking method via a hierarchical cross-attention transformer named HCAT. Our model runs about 195 f ps on GPU, 45 f ps on CPU, and 55 f ps on the edge AI platform of NVidia Jetson AGX Xavier. Experiments show that our HCAT achieves promising results on LaSOT, GOT-10k, TrackingNet, NFS, OTB100, UAV123, and VOT2020. Code and models are available at https://github.com/chenxin-dlut/HCAT.

show abstract

Learning Spatio-Temporal Transformer for Visual Tracking

Cited by 566 publications

References 42 publications

High-Performance Transformer Tracking

High-Performance Transformer Tracking

Video Transformers: A Survey

Efficient Visual Tracking via Hierarchical Cross-Attention Transformer

Contact Info

Product

Resources

About