2022
DOI: 10.48550/arxiv.2203.11926
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Focal Modulation Networks

Abstract: In this work, we propose focal modulation network (FocalNet in short), where self-attention (SA) is completely replaced by a focal modulation module that is more effective and efficient for modeling token interactions. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges at different granularity levels, (ii) gated aggregation to selectively aggregate context features for… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(11 citation statements)
references
References 77 publications
0
11
0
Order By: Relevance
“…For the visual backbone, we adopt pretrained Swin-T/L [34] by default. We also use Focal-T [48] in our ablation studies following [60]. For the language backbone, we adopt the pretrained base model in UniCL [49].…”
Section: Methodsmentioning
confidence: 99%
“…For the visual backbone, we adopt pretrained Swin-T/L [34] by default. We also use Focal-T [48] in our ablation studies following [60]. For the language backbone, we adopt the pretrained base model in UniCL [49].…”
Section: Methodsmentioning
confidence: 99%
“…Inspired by the success of vision transformers, researchers have challenged the traditional small kernel design of CNNs [22,52] and suggested the use of large convolution kernels for visual tasks [11,17,18,38,40,46,73]. For example, ConvNeXt [40] suggest directly adopting a 7×7 depth-wise convolution, while the Visual Attention Network (VAN) [18] uses a kernel size of 21 × 21 and introduces an attention mechanism.…”
Section: Large Kernel Design In Cnnsmentioning
confidence: 99%
“…The inference pathway used to segment a new case follows the same feed forward path shown by the black arrows plus additional re-locating and segmentation steps downstream of the blue arrows. The proposed model is built via a cascade network, which is composed of three subnetworks, that is, a focal modulation, 21 a hierarchical block 22 and a topological 23 fully convolutional network (FCN). We name the proposed cascade network as a topological modulated network.…”
Section: Overviewmentioning
confidence: 99%