Florence: A New Foundation Model for Computer Vision

Liu, Yuan; Chen, Dongdong; Codella, Noel; Dai, Xiyang; Gao, Jianfeng; Hu, Houdong; Huang, Xuedong; Li, Boxin; Li, Chunyuan; Liu, Ce; Liu, Mengchen; Liu, Zicheng; Lu, Yumao; Wang, Limin; Wang, Jianfeng; Xiao, Bin; Xiao, Zhen; Yang, Jianwei; Zeng, Michael; Zhou, Luowei; Zhang, Pengchuan

doi:10.48550/arxiv.2111.11432

Cited by 112 publications

(182 citation statements)

References 56 publications

Supporting

Mentioning

142

Contrasting

Order By: Relevance

“…Our results with just K400 (86.7%) is already similar to recent 86.5% Florence [95] and 86.8% SwinV2-G [58]. Florence uses 900M curated text-image pairs.…”

Section: Main Results On Kineticssupporting

confidence: 86%

See 1 more Smart Citation

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Chen¹,

Fan²,

Xie³

et al. 2021

Preprint

103

View full text Add to dashboard Cite

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pretrained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 38.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.

show abstract

“…Our results with just K400 (86.7%) is already similar to recent 86.5% Florence [95] and 86.8% SwinV2-G [58]. Florence uses 900M curated text-image pairs.…”

Section: Main Results On Kineticssupporting

confidence: 86%

“…Note that these models are strong baselines and are state-of-the-art for training-from-scratch on their own. Still, 300 epochs of MaskFeat pre-training improve the scratch MViT-S, 16×4 [56] Sup., JFT-300M 84.9 95.8 3981×3×4 TokenLearner [75] Sup., JFT-300M 85.4 N/A 4076×3×4 Florence↑384 [95] Text, FLD-900M 86. 5 3.…”

Section: Main Results On Kineticsmentioning

confidence: 99%

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Chen¹,

Fan²,

Xie³

et al. 2021

Preprint

103

View full text Add to dashboard Cite

show abstract

“…2. In the second part of the tables, we compare to methods that are pretrained on web-scale datasets such as Instagram 65M [25], JFT-300M [62], JFT-3B [81], WTS [61], Florence [80] or HowTo100M [47]. Observe that we achieve state-of-the-art results both with and without web-scale pretraining.…”

Section: Comparison To the State-of-the-artmentioning

confidence: 98%

Multiview Transformers for Video Recognition

Shen¹,

Xiong²,

Arnab³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video understanding requires reasoning at multiple spatiotemporal resolutions -from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the stateof-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes. Furthermore, we achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining. We will release code and pretrained checkpoints.

show abstract

“…In contrast, our benchmark focuses on task-level transfer across domains, i.e., it aims to evaluate the transferability of models, by pre-training from their own large corpus, then evaluating zero-shot performance on a diverse set of downstream datasets. This setting has been recently studied [32,51,33,72], and is arguably more practical for real-world applications, as it brings the convenience towards the spirit of one-model-for-all. The well-known ImageNet-1K dataset [9] was originally proposed as a large dataset for model training and testing.…”

Section: Visual Recognition Benchmarks: Zero-shot and Transfer Learningmentioning

confidence: 99%

“…The success has quickly inspired many follow-up large-scale pre-training works [68,72,69,36,43,20,34,74]. Each of them developed their own evaluation experiments, covering a customized subset of tasks, and leaving the details of model adaptation process less accessible.…”

Section: Introductionmentioning

confidence: 99%

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Li¹,

Liu²,

Li³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets/tasks. However, it remains a challenge to evaluate the transferablity of these foundation models due to the lack of easy-to-use toolkits for fair benchmarking. To tackle this, we build ELEVATER 1 , the first benchmark to compare and evaluate pre-trained language-augmented visual models. Several highlights include: (i) Datasets. As downstream evaluation suites, it consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. (ii) Toolkit. An automatic hyper-parameter tuning toolkit is developed to ensure the fairness in model adaption. To leverage the full power of language-augmented visual models, novel language-aware initialization methods are proposed to significantly improve the adaption performance. (iii) Metrics. A variety of evaluation metrics are used, including sample-efficiency (zero-shot and few-shot) and parameter-efficiency (linear probing and full model fine-tuning). We will release our toolkit and evaluation platforms for the research community.

show abstract

Florence: A New Foundation Model for Computer Vision

Cited by 112 publications

References 56 publications

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Multiview Transformers for Video Recognition

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Contact Info

Product

Resources

About