In the realm of outdoor scene 3D reconstruction from a solitary RGB aerial image, a persistent challenge pertains to the issue of inadequate reconstruction accuracy. In this article, we introduce a novel approach, namely, the Global Soft Pooling Adaptive Attention Network, designed to facilitate high-precision 3D scene reconstruction in outdoor environments. This network primarily comprises two distinct attention network modules: the Global Soft Pooling Dynamic Convolutional Attention and the Three-Head Adaptive Graph Attention. The Global Soft Pooling Attention Network employs a stem_conv and four MBdyconv components for the extraction of multi-scale image features. It harnesses global soft pooling techniques, integrating both channels fusion and spatial attention mechanisms, thereby enabling the precise acquisition of composite features. These features are subsequently assigned to initialize mesh vertices. The Three-Head Adaptive Graph Attention Network leverages three sets of adaptive 1D convolutions to derive 3D vertex features, taking into account the weights of neighboring vertices. A linear layer is employed to compute the vertex coordinate offsets (denoted as "ΔV"), which are then utilized to refine the initial mesh vertices, ultimately achieving an enhanced mesh model. When evaluated on the publicly available SensatUrban dataset, our proposed method demonstrates impressive performance metrics, achieving a reconstruction performance index of 0.96 for l2 and 1.68 for l3. Experimental results unequivocally demonstrate that our approach outperforms existing deep learning methodologies, establishing itself as the state-of-the-art solution for outdoor 3D scene reconstruction.