Siamese network based trackers formulate tracking as convolutional feature cross-correlation between a target template and a search region. However, Siamese trackers still have an accuracy gap compared with state-of-theart algorithms and they cannot take advantage of features from deep networks, such as ResNet-50 or deeper. In this work we prove the core reason comes from the lack of strict translation invariance. By comprehensive theoretical analysis and experimental validations, we break this restriction through a simple yet effective spatial aware sampling strategy and successfully train a ResNet-driven Siamese tracker with significant performance gain. Moreover, we propose a new model architecture to perform layer-wise and depthwise aggregations, which not only further improves the accuracy but also reduces the model size. We conduct extensive ablation studies to demonstrate the effectiveness of the proposed tracker, which obtains currently the best results on five large tracking benchmarks, including OTB2015, VOT2018, UAV123, LaSOT, and TrackingNet. Our model will be released to facilitate further researches. * The first three authors contributed equally. Work done at SenseTime. Project page: https://lb1100.github.io/SiamRPN++. Recently, the Siamese network based trackers [40,1,15,42,41,24,43,52,44] have drawn much attention in the community. These Siamese trackers formulate the visual object tracking problem as learning a general similarity map by cross-correlation between the feature representations learned for the target template and the search region. To ensure tracking efficiency, the offline learned Siamese similarity function is often fixed during the running time [40,1,15]. The CFNet tracker [41] and DSiam tracker [11] update the tracking model via a running average template and a fast transformation module, respectively. The SiamRNN tracker [24] introduces the region proposal network [24] after the Siamese network and performs joint classification and regression for tracking. The DaSiamRPN tracker [52] further introduces a distractor-aware module and improves the discrimination power of the model.