Single-frame infrared small target detection is still a challenging task due to the complex background and unobvious structural characteristics of small targets. Recently, convolutional neural networks (CNN) began to appear in the field of infrared small target detection and have been widely used for excellent performance. However, existing CNN-based methods mainly focus on local spatial features while ignoring the long-range contextual dependencies between small targets and backgrounds. To capture the global context-aware information, we propose a fusion network architecture of Transformer and CNN (FTC-Net), which consists of two branches. The CNN-based branch uses a U-Net with skip connections to obtain low-level local details of small targets. The Transformer-based branch applies hierarchical selfattention mechanisms to learn long-range contextual dependencies. Specifically, The Transformer branch can suppress background interferences and enhance target features. To obtain local and global feature representation, we design a feature fusion module (FFM) to realize the feature concentration of two branches. We implement ablation and comparative experiments on a publicly accessed SIRST dataset. Experimental results show that the Transformer-based branch is effective and suggest the superiority of the proposed FTC-Net compared with other stateof-the-art methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.