ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Tewel, Yoad; Shalev, Yoav; Wolf, Lior

doi:10.1109/cvpr52688.2022.01739

Cited by 76 publications

(44 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As expected, these achieve a better score than CapDec, as they exploit the additional supervision of image-text pairs. Nevertheless, compared to the unsupervised approaches of MAGIC (Su et al, 2022) and Zero-Cap (Tewel et al, 2022), CapDec achieves superior scores. Note that ZeroCap does not require any training data, while MAGIC requires text data similar to our setting.…”

Section: Resultsmentioning

confidence: 96%

“…CLIP (2021) marked a turning point in visionlanguage perception, and has been utilized for vision-related tasks by various distillation techniques Song et al, 2022;Jin et al, 2021;Gal et al, 2021;Khandelwal et al, 2022). Recent captioning methods use CLIP for reducing training time (Mokady et al, 2021), improved captions (Shen et al, 2021;Luo et al, 2022a,b;Cornia et al, 2021;Kuo and Kira, 2022), and in zero-shot settings (Su et al, 2022;Tewel et al, 2022). However, zero-shot techniques often result in inferior performance, as the produced captions are not compatible with the desired target style, which is usually dictated by a dataset.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Text-Only Training for Image Captioning using Noise-Injected CLIP

Nukrai¹,

Mokady²,

Globerson³

2022

Preprint

View full text Add to dashboard Cite

We consider the task of image-captioning using only the CLIP model and additional text data at training time, and no additional captioned images. Our approach relies on the fact that CLIP is trained to make visual and textual embeddings similar. Therefore, we only need to learn how to translate CLIP textual embeddings back into text, and we can learn how to do this by learning a decoder for the frozen CLIP text encoder using only text. We argue that this intuition is "almost correct" because of a gap between the embedding spaces, and propose to rectify this via noise injection during training. We demonstrate the effectiveness of our approach by showing SOTA zero-shot image captioning across four benchmarks, including style transfer. Code, data, and models are available at https://github. com/DavidHuji/CapDec.

show abstract

Section: Resultsmentioning

confidence: 96%

Section: Related Workmentioning

confidence: 99%

Text-Only Training for Image Captioning using Noise-Injected CLIP

Nukrai¹,

Mokady²,

Globerson³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A game engine can be modified to produce both graphical and textual output, which can then be used for bug detection. However, during a preliminary study, we tested CLIP-Cap (Mokady, Hertz, and Bermano 2021), ZeroCap (Tewel et al 2022) and OFA (Wang et al 2022) to create descriptions of videos, and found that none of them can describe frames from video games properly. Future studies should investigate how the description of event sequences can be automated.…”

Section: Future Research Directionsmentioning

confidence: 99%

Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors

Taesiri¹,

Macklon²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video game testing requires game-specific knowledge as well as common sense reasoning about the events in the game. While AI-driven agents can satisfy the first requirement, it is not yet possible to meet the second requirement automatically. Therefore, video game testing often still relies on manual testing, and human testers are required to play the game thoroughly to detect bugs. As a result, it is challenging to fully automate game testing. In this study, we explore the possibility of leveraging the zero-shot capabilities of large language models for video game bug detection. By formulating the bug detection problem as a question-answering task, we show that large language models can identify which event is buggy in a sequence of textual descriptions of events from a game. To this end, we introduce the GameBugDescriptions benchmark dataset, which consists of 167 buggy gameplay videos and a total of 334 question-answer pairs across 8 games. We extensively evaluate the performance of six models across the OPT and InstructGPT large language model families on our benchmark dataset. Our results show promising results for employing language models to detect video game bugs. With the proper prompting technique, we could achieve an accuracy of 70.66%, and on some video games, up to 78.94%. Our code, evaluation data and the benchmark can be found on https://asgaardlab.github.io/LLMxBugs

show abstract

“…In AI, the most widely known and arguably best performing models (such as GPT-3, Gato, AlphaGo, DALL-E, etc.) are those which use some form of backpropagation or reinforcement learning as the driving learning algorithm (Silver et al, 2016 ; Gundersen and Kjensmo, 2018 ; Brown et al, 2020 ; Zhang and Lu, 2021 ; Reed et al, 2022 ; Tewel et al, 2022 ). Backpropagation is a type of consequence feedback since an error signal derived between the current behavior and desired behavior is propagated mathematically to each node in the network.…”

Section: Introductionmentioning

confidence: 99%

Leveraging conscious and nonconscious learning for efficient AI

Clair¹,

Coward

Schneider

2023

Front. Comput. Neurosci.

View full text Add to dashboard Cite

Various interpretations of the literature detailing the neural basis of learning have in part led to disagreements concerning how consciousness arises. Further, artificial learning model design has suffered in replicating intelligence as it occurs in the human brain. Here, we present a novel learning model, which we term the “Recommendation Architecture (RA) Model” from prior theoretical works proposed by Coward, using a dual-learning approach featuring both consequence feedback and non-consequence feedback. The RA model is tested on a categorical learning task where no two inputs are the same throughout training and/or testing. We compare this to three consequence feedback only models based on backpropagation and reinforcement learning. Results indicate that the RA model learns novelty more efficiently and can accurately return to prior learning after new learning with less computational resources expenditure. The final results of the study show that consequence feedback as interpretation, not creation, of cortical activity creates a learning style more similar to human learning in terms of resource efficiency. Stable information meanings underlie conscious experiences. The work provided here attempts to link the neural basis of nonconscious and conscious learning while providing early results for a learning protocol more similar to human brains than is currently available.

show abstract

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Cited by 76 publications

References 27 publications

Text-Only Training for Image Captioning using Noise-Injected CLIP

Text-Only Training for Image Captioning using Noise-Injected CLIP

Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors

Leveraging conscious and nonconscious learning for efficient AI

Contact Info

Product

Resources

About