SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images

Wang, Zhishe; Chen, Yanlin; Shao, Wang; Li, Hui; Zhang, Lei

doi:10.48550/arxiv.2204.11436

Cited by 6 publications

(16 citation statements)

References 41 publications

(116 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fusion finds its utility in a myriad of applications including High Dynamic Range (HDR) imaging, color transfer, and infrared-visible fusion [34]. The literature delineates three primary approaches to image fusion [35]: Low Level Fusion (LLF) or early fusion [36,37], Mid-Level Fusion (MLF) [38,39], and High Level Fusion (HLF) or late fusion [40]. These approaches are distinguished by the stage at which fusion occurs prior to input, during feature extraction, or post feature extraction.…”

Section: Image Fusionmentioning

confidence: 99%

WITHDRAWN: Surface multi-spectral reflectivity and texture material recognition in non-constrained environments

Dhahri,

Amamou,

Faghihi

et al. 2024

Preprint

View full text Add to dashboard Cite

Material recognition is the process of distinguishing various materials based on their inherent physical properties. It plays a pivotal role in numerous applications, including manufacturing, recycling, and robotic handling. Conventional recognition methods predominantly employ sensor-and vision-based approaches. However, these methods often face challenges such as the similarity and variability in material appearance, environmental conditions, and geometric constraints. In this research, we introduce a multimodal, vision-based, attention-driven model for material recognition. Contrary to preceding texture-based and multi-spectral based methods, our approach harnesses both the texture and light reflection distribution characteristics intrinsic to material surfaces. The proposed method features a collocated system that combines depth, RGB, and near-infrared (Near-IR) cameras, along with infrared laser projector. This specific setup was selected to capture reflection distribution and texture across the visible-near-infrared spectrum. Subsequently, the data captured by this setup were processed by a recognition model within a fusion framework. Our results outperform previous methods in terms of accuracy when additional modalities (Depth, Near-IR, laser projectors) are available, while also exhibiting equivalent performance to top RGB-based models when solely reliant on RGB data. Thus, proving the complementarity of the added modalities with visible information.

show abstract

Section: Image Fusionmentioning

confidence: 99%

WITHDRAWN: Surface multi-spectral reflectivity and texture material recognition in non-constrained environments

Dhahri,

Amamou,

Faghihi

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Li et al [25] combined CNN with transformer to extract the local features by CNN and capture long-range dependencies by transformer. Moreover, Wang et al built a pure transformer network to extract the long-range dependency of images, and they designed a L1-norm based strategy to measure and preserve infrared saliency and visible texture information [27]. Ma et al [2] also proposed a pure transformer based fusion model (SwinFusion), which utilizes the cross-domain global learning to implement intra-and inter-domain fusion based on self-attention and cross-attention, and they introduced Swin transformer to extract long-range dependency of images.…”

Section: B Transformer Based Fusion Methodsmentioning

confidence: 99%

“…Li et al [25] and Vibashan et al [26] combined the transformer with CNNs to extract image's local features and long-range dependencies. In addition, Ma et al [2] and Li et al [27] introduced Swin-transformer to infrared and visible image fusion tasks.…”

Section: Introductionmentioning

confidence: 99%

TFIV: Multigrained Token Fusion for Infrared and Visible Image via Transformer

Li,

Yang,

Bai

et al. 2023

IEEE Trans. Instrum. Meas.

View full text Add to dashboard Cite

Infrared and visible image fusion aims to extract complementary features to synthesize a single fused image. Many methods employ convolutional neural networks (CNNs) to extract local features due to its translation invariance and locality. However, CNNs fail to consider the image's non-local self-similarity (NLss), though it can expand the receptive field by pooling operations, it still inevitably leads to information loss. In addition, the transformer structure extracts long-range dependence by considering the correlativity among all image patches, leading to information redundancy of such transformer-based methods. However, graph representation is more flexible than grid (CNN) or sequence (transformer structure) representation to address irregular objects, and graph can also construct the relationships among the spatially repeatable details or texture with far-space distance. Therefore, to address the above issues, it is significant to convert images into the graph space and thus adopt graph convolutional networks (GCNs) to extract NLss. This is because the graph can provide a fine structure to aggregate features and propagate information across the nearest vertices without introducing redundant information. Concretely, we implement a cascaded NLss extraction pattern to extract NLss of intra-and inter-modal by exploring interactions of different image pixels in intra-and inter-image positional distance. We commence by preforming GCNs on each intra-modal to aggregate features and propagate information to extract independent intra-modal NLss. Then, GCNs are performed on the concatenate intramodal NLss features of infrared and visible images, which can explore the cross-domain NLss of inter-modal to reconstruct the fused image. We progressively expand the kennel sizes and dilations to increase the receptive field of GCNs, when we extract intra-and inter-modal NLss. Ablation studies and extensive experiments illustrates the effectiveness and superiority of the proposed method on three datasets.

show abstract

“…To overcome this shortcoming, the transformer is applied into IVF tasks. Since these transformer-based methods integrate the transformer and the CNN [29], [30], the methods can simultaneously extract both the local features and the long-range dependencies [31], [32].…”

Section: A Vision-perception Oriented Ivfmentioning

confidence: 99%

“…Specifically, we compare the proposed method with several high-level vision task-driven methods (i.e., the PSFusion [36], the SegMiF [37], the TarDAL [38]), which include either the semantic segmentation task-driven or the object detection task-driven methods. Finally, we compare the proposed method with several vision-perception oriented methods (i.e., the CBF [15], the DDcGAN [26], the MetaFusion [41], the SwinFuse [32]), which include the traditional methods, CNN-based methods, meta learning-based methods and Transformer-based methods.…”

Section: A Experimental Configurationsmentioning

confidence: 99%

An Infrared and Visible Image Fusion Method Guided by Saliency and Gradient Information

Han

Liu

et al. 2021

IEEE Access

View full text Add to dashboard Cite

Infrared and visible image fusion is a hot topic due to the perfect complementarity of their information. There are two key problems in infrared and visible image fusion. One is how to extract significant target areas and rich texture details from the source images, and the other is how to integrate them to produce satisfactory fused images. To tackle these problems, we propose a novel fusion framework in this paper. A multi-level image decomposition method is used to obtain the base layer and detail layer of the source image. For the fusion of base layer, an ingenious fusion strategy guided by the saliency map of source image is designed to improve the intensity of salient targets and the visual quality of the fused image. For the fusion of detail layer, an efficient approach by introducing the enhanced gradient information is presented to boost the detail features and sharpen the edges of the fused image. Experimental results demonstrate that, compared with fifteen classical and advanced fusion methods, the proposed image fusion framework has better performance in both subjective and objective evaluation.

show abstract

SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images

Cited by 6 publications

References 41 publications

WITHDRAWN: Surface multi-spectral reflectivity and texture material recognition in non-constrained environments

WITHDRAWN: Surface multi-spectral reflectivity and texture material recognition in non-constrained environments

TFIV: Multigrained Token Fusion for Infrared and Visible Image via Transformer

An Infrared and Visible Image Fusion Method Guided by Saliency and Gradient Information

Contact Info

Product

Resources

About