Cross-view geo-localization of unmanned aerial vehicles (UAVs) is a challenging task due to the positional discrepancies and uncertainties in scale and distance between UAVs and satellite views. Existing transformer-based geolocalization methods mainly use encoders to mine image contextual information. However, these methods have some limitations when dealing with scale changes in cross-view images. Therefore, we present an effective transformer-based Siamese network tailored for UAV geo-localization, called GeoFormer. Firstly, an efficient transformer feature extraction network was designed, which utilizes linear attention to reduce the computational complexity and improve the computational efficiency of the network. Among them, we designed an efficient separable perceptron module based on depth-wise separable convolution, which can effectively reduce the computational cost while improving the feature representation of the network. Secondly, we proposed a multi-scale feature aggregation module (MFAM), which deeply fuses salient features at different scales through a feed-forward neural network to generate global feature representations with rich semantics, which improves the model's ability to capture image details and represent robust features. Additionally, we designed a semantic-guided region segmentation module (SRSM), which utilizes a k-modes clustering algorithm to divide the feature map into multiple regions with semantic consistency and performs feature recognition within each semantic region to improve the accuracy of image matching. Finally, we designed a hierarchical reinforcement rotation matching strategy to achieve accurate UAV geo-localization based on the retrieval results of UAV view query satellite images using SuperPoint keypoints extraction and LightGlue rotation matching. According to the experimental results, our method effectively achieves UAV geo-localization.