Propagation and association tasks in Multi-Object Tracking (MOT) play a pivotal role in accurately linking the trajectories of moving objects. Recently, modern deep learning models have been addressing these tasks by introducing fragmented solutions for each different problem such as appearance modeling, motion modeling, and object associations. To bring unification in the MOT task, we introduce a pixel-guided approach to efficiently build the joint-detection and tracking framework for multi-object tracking. Specifically, the up-sampled multi-scale features from consecutive frames are queued to detect the object locations by using a transformer–decoder, and per-pixel distributions are utilized to compute the association matrix according to object queries. Additionally, we introduce a long-term appearance association on track features to learn the long-term association of tracks against detections to compute the similarity matrix. Finally, a similarity matrix is jointly integrated with the Byte-Tracker resulting in a state-of-the-art MOT performance. The experiments with the standard MOT15 and MOT17 benchmarks show that our approach achieves significant tracking performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.