DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

Jiayu, Jiao,; Tang, Yuming; Lin, Kun-Yu; Gao, Yipeng; Andy, J.; Wang, Yaowei; Zheng, Wei‐Shi

doi:10.1109/tmm.2023.3243616

Cited by 47 publications

(16 citation statements)

References 76 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The dilation value is always constrained to a minimum of 1, which is equivalent to the standard NA and has an upper bound of ⌊nk⌋, where n is the number of tokens and k is the kernel or neighborhood size. DilateFormer 45 shows that distant patches in the shallow layers are mostly irrelevant in semantics modeling for mainstream vision tasks, so we set different dilation values in the shallow and deep layers. Specifically, we set the dilation values to 1, 1, 1, 1, 1, 2, 1, 2, 1, 3, 1, and 3 in the first two RDiNAGs and 1, 1, 1, 2, 1, 3, 1, 4, 1, 6, 1, and 8 in the last four RDiNAGs.…”

Section: Methodsmentioning

confidence: 99%

Image super-resolution using dilated neighborhood attention transformer

Chen,

Zuo,

et al. 2024

J. Electron. Imag.

View full text Add to dashboard Cite

Transformer-based methods have achieved impressive performance in image super-resolution (SR). To reduce the computational cost and redundancy of global attention, most transformer-based methods adopt a localized attention mechanism, which diminishes the desirable characteristics of self-attention (SA), such as the effective modeling of long-range dependencies and the ability to capture a global receptive field. To alleviate this problem, we propose a dilated neighborhood attention transformer for image SR (DiNAT-SR) to improve SwinIR for image SR; in it, we replace SA with DiNA to capture more global data and allow the receptive field to grow exponentially. In addition, we also introduce a convolutional modulation block into the transformer to enhance the visual representation and facilitate smoother convergence during training. Our research has, for the first time, confirmed the feasibility of DiNA in the field of image SR. Experimental results have demonstrated the effectiveness of DiNAT-SR with better results compared with SwinIR on most benchmarks in terms of both quantitatively and visually. We also provide a comparison of light-weight image SR models, and our model performs better than SwinIR-light on all benchmarks, with similar total numbers of parameters and floating-point operations. The effectiveness of each introduced component is also validated by an ablation study.

show abstract

Section: Methodsmentioning

confidence: 99%

Image super-resolution using dilated neighborhood attention transformer

Chen,

Zuo,

et al. 2024

J. Electron. Imag.

View full text Add to dashboard Cite

show abstract

“…The global view of perception inevitably comes with a corresponding cost, i.e., the calculation of the affinity of pairs of tokens at all spatial locations brings unavoidable complex computations and burden computational resources. Thus, to alleviate this problem, many researchers have tried to reduce the attentional operations by restricting them within local windows, axial stripes, or dilated windows [32]- [34]. However, different semantic regions focus on significantly different keyvalue pairs, and forcing all queries to focus on the same set of tokens is suboptimal.…”

Section: Methodsmentioning

confidence: 99%

An Improved Deep Neural Network for Small-Ship Detection in SAR Imagery

Hu,

Miao

2024

IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing

View full text Add to dashboard Cite

Ship detection by using remote-sensing images based on a synthetic aperture radar (SAR) plays an important role in managing water transportation and marine safety. However, complex background, a small ship size, and low focus on small ships results in difficulties in feature extraction and low detection accuracy. This study proposes a new small SAR ship-detection network. First, a transformer-based dynamic sparse attention module is used to improve the focus and extraction of small ship features. Second, the feature maps are fused with deep layers, and small target-friendly detection heads are used to improve the processing of global information in the network. Third, a more suitable fused loss function is used for small ships to ensure the multi-scale detection capability. Experimental results on publicly available datasets, LS-SSDD v1.0 and AIR-SARShip-1.0, show that the proposed method effectively improves the detection accuracy of small ships on SAR images without computational burden boost. Compared with other methods based on the convolutional neural network, the proposed method demonstrates better multiscale detection performance.

show abstract

“…Multiscale defect detection has always been one of the research challenges in computer vision, facing problems such as uneven distribution of category samples, weak small-scale defect features, and easy overlap of defect areas. In the research of the general upstream model, Jiao et al [42] proposed a DilateFormer network with expanded attention, reducing the redundancy of global modeling in Vision Transformer. Zhang et al [43] proposed a novel lightweight and efficient attention module, which improves the residual blocks in the ResNet network and introduces the efficient pyramid split attention block into backbone network architecture, with stronger multi-scale representation capability, suitable for various computer vision tasks.…”

Section: Multi Scale Feature Fusionmentioning

confidence: 99%

FLCNet: faster and lighter cross-scale feature aggregation network for lead bar surface defect detection

Lv,

Xia,

et al. 2024

Meas. Sci. Technol.

View full text Add to dashboard Cite

Aiming at the defect inspection under the characteristics of scale change, high reflection, inclined deformation of defects of lead bars and meeting the needs for faster detection, this paper proposes a faster and lighter cross-scale feature aggregation network (FLCNet). In this study, we focus on the redundancy of channel information, and design a new partial channel group convolution, based on which we design a Faster C3 module and a lightweight cross-scale feature fusion module. In addition, we design a cross-scale slim neck to reduce the redundant feature transfer of the model. Finally, we propose a uniform brightness acquisition method for lead bar sidewall image by using combined light source and construct a lead bar dataset with various complex defect samples. Experiments show that FLCNet effectively improves the detection accuracy of the surface defects of lead bars, the mAP@0.5 value reaches 97.1%, and compared with YOLOv5s, the model’s parameters reduced by 33.9%. At the same time, the detection speed reaches 114.9 FPS, which is faster than other advanced detection models.

show abstract

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

Cited by 47 publications

References 76 publications

Image super-resolution using dilated neighborhood attention transformer

Image super-resolution using dilated neighborhood attention transformer

An Improved Deep Neural Network for Small-Ship Detection in SAR Imagery

FLCNet: faster and lighter cross-scale feature aggregation network for lead bar surface defect detection

Contact Info

Product

Resources

About