SiamMan: Siamese Motion-aware Network for Visual Tracking

Zhou, Wenzhang; Wen, Longyin; Zhang, Libo; Du, Dawei; Luo, Tiejian; Wu, Yanjun

doi:10.48550/arxiv.1912.05515

Cited by 1 publication

(1 citation statement)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [2], Bertinetto et al proposed SiamFC, a pioneering work that combines naive feature correlation with a fullyconvolutional Siamese network for object tracking. Subsequently, some improvements [85,68,75,82,76] are made to Siamese trackers, such as combining with a region proposal network [17,39,80,65] or an anchor-free FCOS detector [11], using a deeper architecture [38] or two-branch structure [23], exploiting attention [67,84] or self-attention [7], applying triplet loss [14]. However, these methods are specially designed for 2D object tracking, so they cannot be directly applied to 3D point clouds.…”

Section: Related Workmentioning

confidence: 99%

3D Siamese Voxel-to-BEV Tracker for Sparse Point Clouds

Hui¹,

Wang²,

Cheng³

et al. 2021

Preprint

View full text Add to dashboard Cite

3D object tracking in point clouds is still a challenging problem due to the sparsity of LiDAR points in dynamic environments. In this work, we propose a Siamese voxel-to-BEV tracker, which can significantly improve the tracking performance in sparse 3D point clouds. Specifically, it consists of a Siamese shape-aware feature learning network and a voxel-to-BEV target localization network. The Siamese shape-aware feature learning network can capture 3D shape information of the object to learn the discriminative features of the object so that the potential target from the background in sparse point clouds can be identified. To this end, we first perform template feature embedding to embed the template's feature into the potential target and then generate a dense 3D shape to characterize the shape information of the potential target. For localizing the tracked target, the voxel-to-BEV target localization network regresses the target's 2D center and the z-axis center from the dense bird's eye view (BEV) feature map in an anchor-free manner. Concretely, we compress the voxelized point cloud along z-axis through max pooling to obtain a dense BEV feature map, where the regression of the 2D center and the z-axis center can be performed more effectively. Extensive evaluation on the KITTI and nuScenes datasets shows that our method significantly outperforms the current state-of-the-art methods by a large margin. Code is available at https: //github.com/fpthink/V2B.

show abstract