Multimodal image fusion is an important area of research with various applications in computer vision. This research proposes a modification to convolutional layers by fusing two different modalities of images. A novel architecture that uses adaptive fusion mechanisms to learn the optimal weightage of different modalities at each convolutional layer is introduced in the research. The proposed method is evaluated on a publicly available dataset, and the experimental results show that the performance of the proposed method outperforms state-of-the-art methods in terms of various evaluation metrics.