CMT: Context-Matching-Guided Transformer for 3D Tracking in Point Clouds

Guo, Zhiyang; Mao, Yunyao; Zhou, Wengang; Wang, Min; Li, Houqiang

doi:10.1007/978-3-031-20047-2_6

Cited by 14 publications

(13 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We present a comprehensive comparison of our method with the previous state-of-the-art approaches, namely SC3D [8], P2B [23], 3DSiamRPN [6], LTTR [5], MLVS-Net [30], BAT [34], PTT [24], V2B [10], CMT [9], PTTR [36], STNet [11], TAT [16], M2-Track [35] and CX-Track [31] on the KITTI dataset. The published results from corresponding papers are reported.…”

Section: Resultsmentioning

confidence: 99%

“…V2B [10] proposes to transform point features into a dense bird's eye view feature map to tackle the sparsity of point clouds. LTTR [5], PTTR [36], CMT [9] and STNet [11] introduce various attention mechanisms into the 3D SOT task for better target-specific feature propagation. PTTR [36] also proposes a light-weight Prediction Refinement Module for coarse-to-fine localization.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors

Xu¹,

Guo²,

Lai³

et al. 2023

Preprint

View full text Add to dashboard Cite

3D single object tracking has been a crucial problem for decades with numerous applications such as autonomous driving. Despite its wide-ranging use, this task remains challenging due to the significant appearance variation caused by occlusion and size differences among tracked targets. To address these issues, we present MBPTrack, which adopts a Memory mechanism to utilize past information and formulates localization in a coarse-to-fine scheme using Box Priors given in the first frame. Specifically, past frames with targetness masks serve as an external memory, and a transformer-based module propagates tracked target cues from the memory to the current frame. To precisely localize objects of all sizes, MBPTrack first predicts the target center via Hough voting. By leveraging box priors given in the first frame, we adaptively sample reference points around the target center that roughly cover the target of different sizes. Then, we obtain dense feature maps by aggregating point features into the reference points, where localization can be performed more effectively. Extensive experiments demonstrate that MBPTrack achieves state-of-the-art performance on KITTI, nuScenes and Waymo Open Dataset, while running at 50 FPS on a single RTX3090 GPU.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors

Xu¹,

Guo²,

Lai³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…V2B (Hui et al 2021) performs Voxel-to-BEV transformation for object localization on the densified feature maps. Inspired by the success of Transformer (Vaswani et al 2017) on computer vision tasks (Liu et al 2021;Carion et al 2020a), several studies (Zhou et al 2022;Cui et al 2021;Shan et al 2021;Hui et al 2022;Guo et al 2022;Nie et al 2023;Xu et al 2023) incorporate Transformer for enhanced feature extraction and correlation modeling and achieve improved accuracy.…”

Section: Related Workmentioning

confidence: 99%

Modeling Continuous Motion for 3D Point Cloud Object Tracking

Luo,

Zhang,

Zhou

et al. 2024

AAAI

View full text Add to dashboard Cite

The task of 3D single object tracking (SOT) with LiDAR point clouds is crucial for various applications, such as autonomous driving and robotics. However, existing approaches have primarily relied on appearance matching or motion modeling within only two successive frames, thereby overlooking the long-range continuous motion property of objects in 3D space. To address this issue, this paper presents a novel approach that views each tracklet as a continuous stream: at each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank, enabling efficient exploitation of sequential information. To achieve effective cross-frame message passing, a hybrid attention mechanism is designed to account for both long-range relation modeling and local geometric feature extraction. Furthermore, to enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed, which uses ground truth tracklets to augment training sequences and promote discrimination against false positives in a contrastive manner. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art method by significant margins on multiple benchmarks.

show abstract

“…V2B [33] designs a voxel-to-BEV object localization network to tackle sparse point clouds. Other techniques such as LTTR [34], PTT [35], PTTR [36], STNet [37], and CMT [38] develop sophisticated transformer structures to improve feature fusion or object localization. Nevertheless, none of them challenges q 1 p1, , s1 q 1 pi, , si q i pi, , si q i pK, , sK q K pK, , sK q K ... ... p1, , s1 q 1 pi, , si q i pK, , sK q K Vote Cluster Features 3D Proposals…”

Section: A 3d Siamese Trackingmentioning

confidence: 99%

“…As mentioned above, several trackers [34]- [38] based on transformer architecture have been introduced for 3D SOT on point clouds. These methods typically employ self-attention to refine features or cross-attention to facilitate interaction between the features extracted from the template and search regions.…”

Section: B Vision Transformermentioning

confidence: 99%

Learning Localization-Aware Target Confidence for Siamese Visual Tracking

Nie

Yang

et al. 2023

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Siamese trackers based on 3D region proposal network (RPN) have shown remarkable success with deep Hough voting. However, using a single seed point feature as the cue for voting fails to produce high-quality 3D proposals. Additionally, the equal treatment of seed points in the voting process, regardless of their significance, exacerbates this limitation. To address these challenges, we propose a novel transformer-based voting scheme to generate better proposals. Specifically, a global-local transformer (GLT) module is devised to integrate object-and patch-aware geometric priors into seed point features, resulting in robust and accurate cues for offset learning of seed points. To train the GLT module, we introduce an importance prediction branch that learns the potential importance weights of seed points as a training constraint. Incorporating this transformerbased voting scheme into 3D RPN, a novel Siamese method dubbed GLT-T is developed for 3D single object tracking on point clouds. Moreover, we identify that the highest-scored proposal in the Siamese paradigm may not be the most accurate proposal, which limits tracking performance. Towards this concern, we approach the binary score prediction task as a ranking problem, and design a target-aware ranking loss and a localization-aware ranking loss to produce accurate ranking of proposals. With the ranking losses, we further present GLT-T++, an enhanced version of GLT-T. Extensive experiments on multiple benchmarks demonstrate that our GLT-T and GLT-T++ outperform state-ofthe-art methods in terms of tracking accuracy while maintaining a real-time inference speed. The source code will be made available at https://github.com/haooozi/GLT-T.

show abstract

CMT: Context-Matching-Guided Transformer for 3D Tracking in Point Clouds

Cited by 14 publications

References 40 publications

MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors

MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors

Modeling Continuous Motion for 3D Point Cloud Object Tracking

Learning Localization-Aware Target Confidence for Siamese Visual Tracking

Contact Info

Product

Resources

About