By integrating visual and natural language understanding, Visual Question Answering (VQA) holds promise for enhancing the intelligence of computer systems, thereby improving user work efficiency. However, current research in the VQA field faces two major challenges: low precision in feature pre-extraction and low efficiency in feature fusion. This study proposes a pre-trained framework based on RSFE-TBT for question-answering tasks aimed at improving feature pre-extraction accuracy and feature fusion efficiency across different modalities. Addressing the issue of low pre-extraction accuracy in existing methods, ResNet50-SF is proposed for pre-extracting image features. Considering ResNet50's limitations in recognizing small objects and specific spatial positions, a bidirectional feature pyramid network(BiFPN) with spatial attention is introduced. Text block features bridging image and text are introduced to address challenges in aligning multi-modal features and fusion efficiency. These encompass block position, shape, sequence, and relative arrangements. Image features are segmented into semantic and spatial aspects, while text blocks are divided into positional and index attributes. Efficient multimodal fusion is achieved with multi-layer Transformer Encoders. Experimental results on the MultiDoc-InfoExtract Dataset demonstrate superior performance of this method in Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks, achieving F1, Precision, and Recall scores of 0.975, 0.975, and 0.975, respectively, in SER tasks, and 0.969, 0.953, and 0.986, respectively, in RE tasks, with a single image inference time of only 0.082s. Ablation studies validate the significance of the improved image feature extraction model, the inclusion of text block features, feature refinement strategies, and the Transformer Encoder architecture in enhancing the performance of the question-answering system. Additionally, comparative studies have shown that RSFE-TBT outperforms competing models regarding accuracy, speed, and size.