Global Tracking Transformers

Zhou, Xingyi; Yin, Tianwei; Koltun, Vladlen; Krähenbühl, Philipp

doi:10.1109/cvpr52688.2022.00857

Cited by 93 publications

(40 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…4.1. Note that while AOA [14] has a better ClsA, it ensembles multiple few-shot detection and re-identification models trained on additional datasets as reported by previous works [36,81]. Overall, our approach surpasses the previous state-of-the-art by 1.4 points in TETA and 2.3 points in Track mAP while using a weaker backbone and the same detector.…”

Section: Comparison To State-of-the-artmentioning

confidence: 61%

“…For our ablation studies, we use the same 6 epoch fine-tuning as above. For data hallucination, we use the combined LVISv1 and COCO annotations as used in [10,18,81]. Note that for data hallucination, we only add objects with a bounding box area greater than 64 2 to A + .…”

Section: Experiments Detailsmentioning

confidence: 99%

“…Thus, many works have focused on data association, aiming to exploit similarity cues such as visual appearance [4,18,34,44,52,62,68,72], 2D object motion [5,6,17,28,71] or 3D object motion [27,42,46,49,50,64] most effectively. Recently, researchers have focused on learning data association with graph neural networks [7,63] or transformers [43,65,76,81]. However, those works dismiss a more profound problem in the tracking-by-detection pipeline that precedes data association: Contemporary object detectors [25,38,[57][58][59] are designed for closed-set scenarios where all objects appear frequently in the training and testing data distributions.…”

Section: Related Workmentioning

confidence: 99%

“…Hence, Dave et al [10] proposed a new benchmark, TAO, that focuses on studying MOT in the long-tail of the object category distribution. On this benchmark, GTR [81], AOA [14], QDTrack [18] and TET [36] achieve impressive performance. However, those works are still limited to pre-defined object categories and thus do not scale to the diversity of real-world settings.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Tracking Every Thing in the Wild

Liu

Danelljan

Ding

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The ability to recognize, localize and track dynamic objects in a scene is fundamental to many real-world applications, such as self-driving and robotic systems. Yet, traditional multiple object tracking (MOT) benchmarks rely only on a few object categories that hardly represent the multitude of possible objects that are encountered in the real world. This leaves contemporary MOT methods limited to a small set of pre-defined object categories. In this paper, we address this limitation by tackling a novel task, openvocabulary MOT, that aims to evaluate tracking beyond predefined training categories. We further develop OVTrack, an open-vocabulary tracker that is capable of tracking arbitrary object classes. Its design is based on two key ingredients: First, leveraging vision-language models for both classification and association via knowledge distillation; second, a data hallucination strategy for robust appearance feature learning from denoising diffusion probabilistic models. The result is an extremely data-efficient open-vocabulary tracker that sets a new state-of-the-art on the large-scale, largevocabulary TAO benchmark, while being trained solely on static images.

show abstract

Section: Comparison To State-of-the-artmentioning

confidence: 61%

Section: Experiments Detailsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Tracking Every Thing in the Wild

Liu

Danelljan

Ding

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…It is capable of linking objects after a long time span, which is realized by storing the identity embeddings of the tracked objects in a large spatiotemporal memory, and by adaptively referencing and aggregating useful information from the memory as needed. Global Tracking Transformers (GTR) (Zhou et al, 2022)global is a global MOT network structure based on transformers, which uses them to encode all target features in the input video sequence and assigns the targets to different trajectories using trajectory queries.…”

Section: Vision Transformer-based Motmentioning

confidence: 99%

UMOTMA: Underwater multiple object tracking with memory aggregation

Hao

Qiu²,

Zhang³

et al. 2022

Front. Mar. Sci.

View full text Add to dashboard Cite

Underwater multi-object tracking (UMOT) is an important technology in marine animal ethology. It is affected by complex factors such as scattering, background interference, and occlusion, which makes it a challenging computer vision task. As a result, the stable continuation of trajectories among different targets has been the key to the tracking performance of UMOT tasks. To solve such challenges, we propose an underwater multi-object tracking algorithm based on memory aggregation (UMOTMA) to effectively associate multiple frames with targets. First, we propose a long short-term memory (LSTM)-based memory aggregation module (LSMAM) to enhance memory utilization between multiple frames. Next, LSMAM embeds LSTM into the transformer structure to save and aggregate features between multiple frames. Then, an underwater image enhancement module ME is introduced to process the original underwater images, which improves the quality and visibility of the underwater images so that the model can extract better features from the images. Finally, LSMAM and ME are integrated with a backbone network to implement the entire algorithm framework, which can fully utilize the historical information of the tracked targets. Experiments on the UMOT datasets and the underwater fish school datasets show that UMOTMA generally outperforms existing models and can maintain the stability of the target trajectory while ensuring high-quality detection. The code is available via Github.

show abstract

Bridging Images and Videos: A Simple Learning Framework for Large Vocabulary Video Object Detection

Woo

Park

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Global Tracking Transformers

Cited by 93 publications

References 40 publications

Tracking Every Thing in the Wild

Tracking Every Thing in the Wild

UMOTMA: Underwater multiple object tracking with memory aggregation

Bridging Images and Videos: A Simple Learning Framework for Large Vocabulary Video Object Detection

Contact Info

Product

Resources

About