Yoad Tewel scite author profile

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests. Our code is available at: https://github. com/YoadTew/zero-shot-image-to-text.

show abstract

Zero-Shot Video Captioning with Evolving Pseudo-Tokens

Tewel¹,

Shalev²,

Nadler³

et al. 2022

Preprint

View full text Add to dashboard Cite

Key-Locked Rank One Editing for Text-to-Image Personalization

Tewel

Gal

Chechik

et al. 2023

View full text Add to dashboard Cite

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

Shaharabany¹,

Tewel²,

Wolf³

2022

Preprint

View full text Add to dashboard Cite

Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects. This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism. Moreover, training takes place in a weakly supervised setting, where no bounding boxes are provided. To achieve this, our method combines two pre-trained networks: the CLIP image-to-text matching score and the BLIP image captioning tool. Training takes place on COCO images and their captions and is based on CLIP. Then, during inference, BLIP is used to generate a hypothesis regarding various regions of the current image. Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains. It also shows very convincing results in the novel task of weakly-supervised open-world purely visual phrase-grounding presented in our work. For example, on the datasets used for benchmarking phrasegrounding, our method results in a very modest degradation in comparison to methods that employ human captions as an additional input. Our code is available at https://github.com/talshaharabany/what-is-where-by-looking and a live demo can be found at https://replicate.com/talshaharabany/ what-is-where-by-looking.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yoad Tewel

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Zero-Shot Video Captioning with Evolving Pseudo-Tokens

Key-Locked Rank One Editing for Text-to-Image Personalization

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

Contact Info

Product

Resources

About