Learning the Best Pooling Strategy for Visual Semantic Embedding

Chen, Jiacheng; Hu, Hexiang; Wu, Hao; Jiang, Yuning; Wang, Changhu

doi:10.1109/cvpr46437.2021.01553

Cited by 115 publications

(57 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This focus makes our work closely related the recent work of image-text models such as CLIP [64] and ALIGN [37], both of which have shown zero-shot transfer ability for image classification. Similar to CLIP and ALIGN, our work also learns the mapping between images and texts, which is related to many previous works, such as [2,3,8,13,22,24,32,33,36,42,51,52,54,56,57,60,[71][72][73]91].…”

Section: Related Workmentioning

confidence: 83%

Combined Scaling for Open-Vocabulary Image Classification

Pham

Dai

Ghiasi

et al. 2021

Preprint

View full text Add to dashboard Cite

We present a combined scaling method called BASIC that achieves 85.7% top-1 zero-shot accuracy on the ImageNet ILSVRC-2012 validation set, surpassing the best-published zero-shot models -CLIP and ALIGN -by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and Object-Net, our model achieves 83.7% top-1 average accuracy, only a small drop from the its original ImageNet accuracy.To achieve these results, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions: data size, model size, and batch size. Our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN. The main challenge with scaling is the limited memory of our accelerators such as GPUs and TPUs. We hence propose a simple method of online gradient caching to overcome this limit.

show abstract

Section: Related Workmentioning

confidence: 83%

Combined Scaling for Open-Vocabulary Image Classification

Pham

Dai

Ghiasi

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Text-to-image R@1 R@5 R@10 R@1 R@5 R@10 46.6 71.3 82. 4 fectiveness of the proposed architecture. Tab.…”

Section: Ctc-1k Image-to-textmentioning

confidence: 99%

“…Text-to-image Image-to-text Text-to-image R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 SCAN [22] -67. 4 art methods at a low speed. Our model is not affected in those datasets when the modality of scene text is missing and still performs well on downstream tasks due to the fusion token based vision and scene text aggregation.…”

Section: Flickr30k (1k) Ms-coco (5k) Image-to-textmentioning

confidence: 99%

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

Cheng¹,

Sun²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Visual appearance is considered to be the most important cue to understand images for cross-modal retrieval, while sometimes the scene text appearing in images can provide valuable information to understand the visual semantics. Most of existing cross-modal retrieval approaches ignore the usage of scene text information and directly adding this information may lead to performance degradation in scene text free scenarios. To address this issue, we propose a full transformer architecture to unify 1. Introduction

show abstract

“…Early works demonstrated how to train zero-shot classifiers based on attributes [35] or numerical descriptors [36]. Another approach, which we adopt in this work, is to learn an alignment between image and text embedding spaces [6,15,21,22,31,69]. This approach has demonstrated that with modern architectures, contrastive learning, and large data sources it is possible to obtain performance that is competitive with the classical two-step approach that involves fine-tuning on the downstream data [30,45].…”

Section: Related Workmentioning

confidence: 99%

LiT: Zero-Shot Transfer with Locked-image text Tuning

Zhai¹,

Wang²,

Mustafa³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning "Locked-image Text tuning" (LiT-tuning), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT-tuned model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiTtuning is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT-tuned model achieves 84.5% zero-shot transfer accuracy on the ImageNet test set, and 81.1% on the challenging out-of-distribution ObjectNet test set.

show abstract

Learning the Best Pooling Strategy for Visual Semantic Embedding

Cited by 115 publications

References 24 publications

Combined Scaling for Open-Vocabulary Image Classification

Combined Scaling for Open-Vocabulary Image Classification

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

LiT: Zero-Shot Transfer with Locked-image text Tuning

Contact Info

Product

Resources

About