Traditional remote sensing spatio-temporal data fusion algorithms generally use up-sampled low-resolution images (MODIS) to be fused with high-resolution images (Landsat), this makes both images less spatially consistent and many hybrid image elements in low-resolution images, so uncertainty errors propagate into the fusion results. To address this issue, we propose a framework for combining deep learning-based super-resolution techniques with traditional spatio-temporal fusion methods. By reconstructing low-resolution images using super-resolution image reconstruction techniques, we obtain low-resolution images with more spatial details and better spatial consistency with highresolution images. These reconstructed images are then fused using spatio-temporal fusion methods. In this study, we selected Flexible Spatio-Temporal Data Fusion (FSDAF) and Residual Channel Attention Network (RCAN) to carry out a detailed study to prove the effectiveness of this kind of framework. That is, a new RCAN-FSDAF model is developed. After testing, RCAN-FSDAF has the following advantages::(1) The band reflectance predicted by RCAN-FSDAF is closer to base reflectance than FSDAF, DMNet, and GAN-STFM, as shown by greater correlation and smaller error. (2) RCAN-FSDAF better decomposes image elements among heterogeneous features and more accurately identifies boundaries between different features and changes in land cover type. (3) High spatial and temporal resolution NDVI data obtained by the inversion of the prediction results of RCAN-FSDAF are more accurate. The framework developed in this study can be extended to other spatial and temporal data fusion applications.