2023
DOI: 10.1109/tmm.2023.3243616
|View full text |Cite
|
Sign up to set email alerts
|

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
10

Relationship

0
10

Authors

Journals

citations
Cited by 47 publications
(16 citation statements)
references
References 76 publications
0
14
0
Order By: Relevance
“…The dilation value is always constrained to a minimum of 1, which is equivalent to the standard NA and has an upper bound of nk, where n is the number of tokens and k is the kernel or neighborhood size. DilateFormer 45 shows that distant patches in the shallow layers are mostly irrelevant in semantics modeling for mainstream vision tasks, so we set different dilation values in the shallow and deep layers. Specifically, we set the dilation values to 1, 1, 1, 1, 1, 2, 1, 2, 1, 3, 1, and 3 in the first two RDiNAGs and 1, 1, 1, 2, 1, 3, 1, 4, 1, 6, 1, and 8 in the last four RDiNAGs.…”
Section: Methodsmentioning
confidence: 99%
“…The dilation value is always constrained to a minimum of 1, which is equivalent to the standard NA and has an upper bound of nk, where n is the number of tokens and k is the kernel or neighborhood size. DilateFormer 45 shows that distant patches in the shallow layers are mostly irrelevant in semantics modeling for mainstream vision tasks, so we set different dilation values in the shallow and deep layers. Specifically, we set the dilation values to 1, 1, 1, 1, 1, 2, 1, 2, 1, 3, 1, and 3 in the first two RDiNAGs and 1, 1, 1, 2, 1, 3, 1, 4, 1, 6, 1, and 8 in the last four RDiNAGs.…”
Section: Methodsmentioning
confidence: 99%
“…The global view of perception inevitably comes with a corresponding cost, i.e., the calculation of the affinity of pairs of tokens at all spatial locations brings unavoidable complex computations and burden computational resources. Thus, to alleviate this problem, many researchers have tried to reduce the attentional operations by restricting them within local windows, axial stripes, or dilated windows [32]- [34]. However, different semantic regions focus on significantly different keyvalue pairs, and forcing all queries to focus on the same set of tokens is suboptimal.…”
Section: Methodsmentioning
confidence: 99%
“…Multiscale defect detection has always been one of the research challenges in computer vision, facing problems such as uneven distribution of category samples, weak small-scale defect features, and easy overlap of defect areas. In the research of the general upstream model, Jiao et al [42] proposed a DilateFormer network with expanded attention, reducing the redundancy of global modeling in Vision Transformer. Zhang et al [43] proposed a novel lightweight and efficient attention module, which improves the residual blocks in the ResNet network and introduces the efficient pyramid split attention block into backbone network architecture, with stronger multi-scale representation capability, suitable for various computer vision tasks.…”
Section: Multi Scale Feature Fusionmentioning
confidence: 99%