TransTrack: Multiple Object Tracking with Transformer

Sun, Peize; Jiang, Yi; Zhang, Rufeng; Xie, Enze; Cao, Jinkun; Hu, Xinting; Kong, Tao; Yuan, Zehuan; Wang, Changhu; Luo, Ping

doi:10.48550/arxiv.2012.15460

Cited by 109 publications

(201 citation statements)

References 60 publications

Supporting

Mentioning

200

Contrasting

Order By: Relevance

“…The elegance of ViT [23] has also motivated similar model designs with simpler global operators such as MLP-Mixer [85], gMLP [53], GFNet [74], and FNet [43], to name a few. Despite successful applications to many high-level tasks [4,23,56,83,87,100], the efficacy of these global models on low-level enhancement and restoration problems has not been studied extensively. The pioneering works on Transformers for lowlevel vision [9,14] directly applied full self-attention, which only accepts relatively small patches of fixed sizes (e.g., 48×48).…”

Section: Enhancementmentioning

confidence: 99%

MAXIM: Multi-Axis MLP for Image Processing

Tu¹,

Talebi²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks for using Transformers and MLPs in image restoration. In this work we present a multi-axis MLP based architecture, called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for crossfeature mutual conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and 'fully-convolutional', two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models.

show abstract

Section: Enhancementmentioning

confidence: 99%

MAXIM: Multi-Axis MLP for Image Processing

Tu¹,

Talebi²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Track-RCNN [22] and FairMOT [23] further add a Re-ID branch on top of object detector in a joint training framework, incorporating object detection and Re-ID feature learning. Based on DETR, TransTrack [9] and TrackFormer [24] develop the transformer-based frameworks for MOT.…”

Section: Related Workmentioning

confidence: 99%

“…Based on transformer-based methods [9], and taking the heterogenicity of different modalities, our CMC2R is a fully endto-end framework, which fuses the information collaboratively using the two-stream structure and the transformer structure, and the detection and tracking is trained jointly. Secondly, The NMS is not needed for tracks association, and the temporal passing module combined with multi-frame tracking feature is proposed to model the temporal relation.…”

Section: Related Workmentioning

confidence: 99%

“…It proposes the concept of "object query", which is an explicit and decoupled representation of object. Motivated by DETR's great success, TransTrack [9] extends the "object query" concept to model object tracking, called track query. Each track query is responsible to predict an entire track of an object.…”

Section: Introductionmentioning

confidence: 99%

“…Each track query is responsible to predict an entire track of an object. Motivated by [9] and the aforementioned issues, this paper proposes an RGB-T object tracking method based on query-key mechanism and JDT paradigm.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

CMC2R: Cross‐modal collaborative contextual representation for RGBT tracking

Liu

Luo²,

Yan

et al. 2022

IET Image Processing

View full text Add to dashboard Cite

The key challenge in RBGT tracking is how to fuse dual-modality information to build a robust RGB-T tracker. Motivated by CNN structure for local features, and visual transformer structure for global representations, the authors propose a two-stream hybrid structure, termed CMC 2 R, to take advantage of convolutional operations and self-attention mechanisms to lean the enhanced representation. CMC 2 R fuses local features and global representations under different resolutions through the transformer layer of the encoder block, and the two modalities are collaborated to get contextual information by the spatial and channel self-attention. The temporal association is performed with the track query, each track query models the entire track of an object, and updated frame-by-frame to build the long-range temporal relation. Experimental results show the effectiveness of the proposed method, and achieve the SOTAs performance.

show abstract

On Hyperbolic Embeddings in Object Detection

Braun

Schillingmann

Valada

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Self-supervised multi-object trackers have the potential to leverage the vast amounts of raw data recorded worldwide. However, they still fall short in re-identification accuracy compared to their supervised counterparts. We hypothesize that this deficiency results from restricting self-supervised objectives to single frames or frame pairs. Such designs lack sufficient visual appearance variations during training to learn consistent re-identification features. Therefore, we propose a training objective that learns re-identification features over a sequence of frames by enforcing consistent association scores across short and long timescales. Extensive evaluations on the BDD100K and MOT17 benchmarks demonstrate that our learned ReID features significantly reduce ID switches compared to other self-supervised methods, setting the new state of the art for selfsupervised multi-object tracking and even performing on par with supervised methods on the BDD100k benchmark.

show abstract

TransTrack: Multiple Object Tracking with Transformer

Cited by 109 publications

References 60 publications

MAXIM: Multi-Axis MLP for Image Processing

MAXIM: Multi-Axis MLP for Image Processing

CMC2R: Cross‐modal collaborative contextual representation for RGBT tracking

On Hyperbolic Embeddings in Object Detection

Contact Info

Product

Resources

About