2022
DOI: 10.48550/arxiv.2203.14415
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Mugs: A Multi-Granular Self-Supervised Learning Framework

Abstract: In self-supervised learning, multi-granular features are heavily desired though rarely investigated, as different downstream tasks (e.g., general and fine-grained classification) often require different or multigranular features, e.g. fine-or coarse-grained one or their mixture. In this work, for the first time, we propose an effective MUlti-Granular Selfsupervised learning (Mugs) framework to explicitly learn multi-granular visual features. Mugs has three complementary granular supervisions: 1) an instance di… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 37 publications
0
4
0
Order By: Relevance
“…We compare various image encoders and their initialization methods in Table 9. (a) In our default setting, we report results with ViT-L initialized with MUGS (Zhou et al 2022b) pre-training on ImageNet-1k. (b) We get similar/slightly better results using iBOT (Zhou et al 2022a) pre-training.…”
Section: Out Ofmentioning
confidence: 99%
See 1 more Smart Citation
“…We compare various image encoders and their initialization methods in Table 9. (a) In our default setting, we report results with ViT-L initialized with MUGS (Zhou et al 2022b) pre-training on ImageNet-1k. (b) We get similar/slightly better results using iBOT (Zhou et al 2022a) pre-training.…”
Section: Out Ofmentioning
confidence: 99%
“…(c) Our results improve by 2-4% by using the iBOT pre-training on ImageNet-21k, which shows the advantage of our method to leverage the improvement in the image self-supervised methods. (d,e) show our results with ViT-B backbone initialized by MUGS (Zhou et al 2022b) and DINO (Caron et al 2021) pre-training on Imagenet-1k. Lastly, in(f), we show similar results to our default setting with computationally efficient SWIN-B backbone initialized with EsViT (Li et al 2022) pre-training.…”
Section: Out Ofmentioning
confidence: 99%
“…Instance discrimination. Instance discrimination generates multiple views of an image through random image augmentations and then pulls representations of multiple views together [37], [38], [39], [40]. Based on this framework, researchers have proposed different forms of loss functions, such as contrastive learning [11], [41], [42], [43], feature alignment [44], [45], [46], clustering assignment [47], [48], [49], redundancy reduction [50] and relational modeling [4], [51].…”
Section: Self-supervised Learningmentioning
confidence: 99%
“…Deep neural networks (DNNs) have made remarkable success in many fields, e.g. computer vision [7,8,[15][16][17] and natural language processing [18,19]. A noticeable part of such success is contributed by the stochastic gradient based optimizers which find satisfactory solutions with high efficiency.…”
Section: Introductionmentioning
confidence: 99%