Cross-correlation is often used for feature fusion, especially in Siamese-based trackers. However, capturing complex nonlinear relationships is challenging and susceptible to outliers in the sample. Recently, researchers have used Transformers for feature fusion and achieved more significant performance. However, most rely on modeling global token relationships, which can destroy the local and spatial correlations inherent in 2D structures. This paper proposes an efficient tracking algorithm based on central attention and sliding window sampling called SiamCAT. Specifically, significant context augments with sliding windows are suggested to maintain the stability of the 2D input spatial structure. It is based on attention to simulate the processing of 2D data by convolution, and the internal memory composed of learnable parameters realizes the dynamic adjustment of the attention layer. Second, to learn efficient feature fusion, this paper constructs a feature fusion network to effectively combine template features and search features. Experiments show that SiamCAT achieves state-of-the-art results on LaSOT, OTB100, NFS, UAV123, GOT10K, and TrackingNet benchmark and runs in real-time at 47 frames per second on the CPU. The code will be released in https://github.com/cnchange/SiamCAT.