“…Recently, it is a tendency to build performance-efficient deep neural networks for various image fusion tasks due to their strong nonlinear learning abilities. Learning-based fusion architectures, such as autoencoder (AE) [ 13 , 14 , 16 , 19 ], convolutional neural network (CNN) [ 15 , 18 , 20 ] and generative adversarial network (GAN) [ 21 , 22 , 24 , 27 , 29 ] have witnessed obvious improvements in fusion performance, but their single-scale frameworks can hardly capture the full-scale features of the real-world targets and fail to make the fused images photorealistic. More importantly, most methods directly capitalize on the features extracted in the last layer to reconstruct fused images, whereas earlier features do not.…”