MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning

Atito, Sara; Awais, Muhammad; Farooq, Ammarah; Feng, Zhenhua; Kittler, Josef

doi:10.48550/arxiv.2111.15340

Cited by 3 publications

(4 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Both SIMMIM and MAE used ViT-B [13] model for pretraining using Imagenet-1K and finetuned on Imagenet-1K using classification labels. SIMMIM achieved 83.8% by pretraining for 800 epochs while MAE obtained marginally lower performance of 83.6% while requiring twice as many epochs 3 . Another difference is the so called decoder for transformers.…”

Section: Comparison With Post Artmentioning

confidence: 97%

See 1 more Smart Citation

GMML is All you Need

Ahmed¹,

Awais²,

Kittler³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Vision transformers have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry. This has motivated the research in self-supervised transformer pretraining, which does not need to decode the semantic information conveyed by labels to link it to the image properties, but rather focuses directly on extracting a concise representation of the image data that reflects the notion of similarity, and is invariant to nuisance factors. The key vehicle for the self-learning process used by the majority of self-learning methods is the generation of multiple views of the training data and the creation of pretext tasks which use these views to define the notion of image similarity, and data integrity. However, this approach lacks the natural propensity to extract contextual information. We propose group masked model learning (GMML), a self-supervised learning (SSL) mechanism for pretraining vision transformers with the ability to extract the contextual information present in all the concepts in an image. GMML achieves this by manipulating randomly groups of connected tokens, ensuingly covering a meaningful part of a semantic concept, and then recovering the hidden semantic information from the visible part of the concept. GMML implicitly introduces a novel data augmentation process. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor rely on careful implementation details such as large batches and gradient stopping, which are all artefacts of most of the current self-supervised learning techniques. Since its conception at the beginning of 2021, GMML maintains itself as unbeaten SSL method with several desirable benefits and marked a significant milestone in computer vision by being one of the first self-supervised pretraining methods which outperform supervised pretraining consistently with large margin. GMML is simple, elegant and currently the best mechanism to extract information from a given dataset and instil this information into transformer's weights. The source code is publicly available for the community to train on bigger corpora: https://github.com/Sara-Ahmed/GMML. Impact of GMML:We proposed GMML at the beginning of 2021 in [1] using masked autoencoder with reconstruction loss, however the idea is generally applicable [2], [3], [4]. The merits of GMML were shown employing small models and small/medium scale datasets, like tinyImageNet, due to extremely restricted computational resources. Since then, GMML has been widely adopted in computer vision and other related fields. Towards the end of 2021, SIMMIM [5] and MAE [6] applied GMML with reconstruction loss using huge vision transformers on large scale datasets, like ImageNet-1K [7]. GMML is now the leading SSL framework on multiple application areas, giving state-of-the-art results for image classification [3], segmentation [5], audio analysis [8], medical...

show abstract

Section: Comparison With Post Artmentioning

confidence: 97%

“…Two notable extensions of GMML are MC-SSL [3] and iBOT [4]. Both are generalisations of the notion of GMML to non-autoencoder based learning tasks and achieved remarkable performance.…”

Section: Comparison With Post Artmentioning

confidence: 99%

GMML is All you Need

Ahmed¹,

Awais²,

Kittler³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Transformers [29] have shown great success in various Natural Language Processing (NLP) and Computer Vision (CV) tasks [30][31][32][33][34][35][36][37] and are the basis of our proposed framework. Vision transformer [38] The transformer encoder consists of L consecutive Multi-head Self-Attention (MSA) and Multi-Layer Perceptron (MLP) blocks.…”

Section: Vision Transformermentioning

confidence: 99%

Relative attributes classification via transformers and rank SVM loss

Ahmed

Yanıkoğlu

2023

Fifteenth International Conference on Machine Vision (ICMV 2022)

View full text Add to dashboard Cite

We propose a new model for learning to rank two images with respect to their relative strength of expression for a given attribute. We address this problem -called relative attribute learning -using a vision transformer backbone. The embedded representations of the two images to be compared are extracted and used for comparison with a ranking head, in an end-to-end fashion. The results demonstrate the strength of vision transformers and their suitability for relative attributes classification. Our proposed approach outperforms the state-of-the-art by a large margin, achieving 90.40% and 98.14% mean accuracy over the attributes of LFW-10 and Pubfig datasets.

show abstract

“…In this paper, we follow MAE [19] to adopt the most simple and intuitive raw pixels regression. In terms of masking strategies, SiT [2], MC-SSL0.0 [1] and BeiT [3] use a block-wise masking strategy, where a block of neighbouring tokens arranged spatially are masked. MAE [19] and SimMIM [37] use random masking with a large masked patch size or a large proportion of masked patches.…”

Section: Related Workmentioning

confidence: 99%

SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders

Li¹,

Zheng²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, significant progress has been made in masked image modeling to catch up to masked language modeling. However, unlike words in NLP, the lack of semantic decomposition of images still makes masked autoencoding (MAE) different between vision and language. In this paper, we explore a potential visual analogue of words, i.e., semantic parts, and we integrate semantic information into the training process of MAE by proposing a Semantic-Guided Masking strategy. Compared to widely adopted random masking, our masking strategy can gradually guide the network to learn various information, i.e., from intra-part patterns to inter-part relations. In particular, we achieve this in two steps. 1) Semantic part learning: we design a self-supervised part learning method to obtain semantic parts by leveraging and refining the multi-head attention of a ViT-based encoder.2) Semantic-guided MAE (SemMAE) training: we design a masking strategy that varies from masking a portion of patches in each part to masking a portion of (whole) parts in an image. Extensive experiments on various vision tasks show that SemMAE can learn better image representation by integrating semantic information. In particular, SemMAE achieves 84.5% fine-tuning accuracy on ImageNet-1k, which outperforms the vanilla MAE by 1.4%. In the semantic segmentation and fine-grained recognition tasks, SemMAE also brings significant improvements and yields the state-of-the-art performance. * This work was performed when Gang Li was visiting JD Explore Academy as a research intern.Preprint. Under review.

show abstract

MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning

Cited by 3 publications

References 37 publications

GMML is All you Need

GMML is All you Need

Relative attributes classification via transformers and rank SVM loss

SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders

Contact Info

Product

Resources

About