ClipCap: CLIP Prefix for Image Captioning

Mokady, Ron; Hertz, Amir; Bermano, Amit H.

doi:10.48550/arxiv.2111.09734

Cited by 85 publications

(158 citation statements)

References 35 publications

Supporting

Mentioning

137

Contrasting

Order By: Relevance

“…However, this ability is exactly the generative task DALL-E was trained to do, only in new domains. No previous computer vision work, as far as we can ascertain, has Method B@1 R@5 C-s B@1 R@5 C-s B@1 R@5 C-s B@1 R@5 C-s B@1 R@5 C-s CLIP-Prefix [49] 2. Comparison of our method and CLIP-Prefix baseline on our novel benchmark for visual relations.…”

Section: Discussion and Limitationsmentioning

confidence: 99%

“…1, we present our results for COCO's test set [42]. Two recent baselines that use CLIP's embedding are compared to: CLIP-Prefix [49] and CLIP-VL [61]. In CLIP-Prefix, the image is encoded using CLIP and the representation is transferred and plugged as a token into a fine-tuned GPT-2.…”

Section: Image Captioningmentioning

confidence: 99%

“…We compared our results with CLIP-Prefix [49] that encodes the image with CLIP's image encoder and uses it as an initial token for GPT-2. The method is fine-tuned based on COCO dataset.…”

Section: Visual-semantic Arithmetic Studymentioning

confidence: 99%

“…Generated captions by our method and by the baseline methods for images from the MS-COCO [42] test-set. CP=CLIP-Prefix [49], CVL=CLIP-VL [61], VVL=VinVL [74].…”

Section: Visual Relations Benchmark Studymentioning

confidence: 99%

“…14 (shown at the end of the document due to size), we present our results on 200 randomly-selected images along with baselines. For baselines, we use CLIPPrefix [49], CLIP-VL [61], and VinVL [74]. Our method generates original captions that…”

Section: Appendicesmentioning

confidence: 99%

See 4 more Smart Citations

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Tewel¹,

Shalev²,

Wolf³

2021

Preprint

View full text Add to dashboard Cite

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests. Our code is available at: https://github. com/YoadTew/zero-shot-image-to-text.

show abstract

Section: Discussion and Limitationsmentioning

confidence: 99%

Section: Image Captioningmentioning

confidence: 99%

“…We compared our results with CLIP-Prefix [49] that encodes the image with CLIP's image encoder and uses it as an initial token for GPT-2. The method is fine-tuned based on COCO dataset.…”

Section: Visual-semantic Arithmetic Studymentioning

confidence: 99%

“…Generated captions by our method and by the baseline methods for images from the MS-COCO [42] test-set. CP=CLIP-Prefix [49], CVL=CLIP-VL [61], VVL=VinVL [74].…”

Section: Visual Relations Benchmark Studymentioning

confidence: 99%

Section: Appendicesmentioning

confidence: 99%

See 3 more Smart Citations

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Tewel¹,

Shalev²,

Wolf³

2021

Preprint

View full text Add to dashboard Cite

show abstract

XAINES: Explaining AI with Narratives

Hartmann

Feldhus

et al. 2022

Künstl Intell

View full text Add to dashboard Cite

Artificial Intelligence (AI) systems are increasingly pervasive: Internet of Things, in-car intelligent devices, robots, and virtual assistants, and their large-scale adoption makes it necessary to explain their behaviour, for example to their users who are impacted by their decisions, or to their developers who need to ensure their functionality. This requires, on the one hand, to obtain an accurate representation of the chain of events that caused the system to behave in a certain way (e.g., to make a specific decision). On the other hand, this causal chain needs to be communicated to the users depending on their needs and expectations. In this phase of explanation delivery, allowing interaction between user and model has the potential to improve both model quality and user experience. The XAINES project investigates the explanation of AI systems through narratives targeted to the needs of a specific audience, focusing on two important aspects that are crucial for enabling successful explanation: generating and selecting appropriate explanation content, i.e. the information to be contained in the explanation, and delivering this information to the user in an appropriate way. In this article, we present the project’s roadmap towards enabling the explanation of AI with narratives.

show abstract

COATI: Multimodal Contrastive Pretraining for Representing and Traversing Chemical Space

Kaufman,

Williams,

Underkoffler

et al. 2024

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Creating a successful small molecule drug is a challenging multiparameter optimization problem in an effectively infinite space of possible molecules. Generative models have emerged as powerful tools for traversing data manifolds composed of images, sounds, and text and offer an opportunity to dramatically improve the drug discovery and design process. To create generative optimization methods that are more useful than brute-force molecular generation and filtering via virtual screening, we propose that four integrated features are necessary: large, quantitative data sets of molecular structure and activity, an invertible vector representation of realistic accessible molecules, smooth and differentiable regressors that quantify uncertainty, and algorithms to simultaneously optimize properties of interest. Over the course of 12 months, Terray Therapeutics has collected a data set of 2 billion quantitative binding measurements of small molecules to therapeutic targets, which directly motivates multiparameter generative optimization of molecules conditioned on these data. To this end, we present contrastive optimization for accelerated therapeutic inference (COATI), a pretrained, multimodal encoder-decoder model of druglike chemical space. COATI is constructed without any human biasing of features, using contrastive learning from text and 3D representations of molecules to allow for downstream use with structural models. We demonstrate that COATI possesses many of the desired properties of universal molecular embedding: fixed-dimension, invertibility, autoencoding, accurate regression, and low computation cost. Finally, we present a novel metadynamics algorithm for generative optimization using a small subset of our proprietary data collected for a model protein, carbonic anhydrase, designing molecules that satisfy the multiparameter optimization task of potency, solubility, and drug likeness. This work sets the stage for fully integrated generative molecular design and optimization for small molecules.

show abstract

ClipCap: CLIP Prefix for Image Captioning

Cited by 85 publications

References 35 publications

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

XAINES: Explaining AI with Narratives

COATI: Multimodal Contrastive Pretraining for Representing and Traversing Chemical Space

Contact Info

Product

Resources

About