2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01739
|View full text |Cite
|
Sign up to set email alerts
|

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
34
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 76 publications
(44 citation statements)
references
References 27 publications
0
34
0
Order By: Relevance
“…As expected, these achieve a better score than CapDec, as they exploit the additional supervision of image-text pairs. Nevertheless, compared to the unsupervised approaches of MAGIC (Su et al, 2022) and Zero-Cap (Tewel et al, 2022), CapDec achieves superior scores. Note that ZeroCap does not require any training data, while MAGIC requires text data similar to our setting.…”
Section: Resultsmentioning
confidence: 96%
See 1 more Smart Citation
“…As expected, these achieve a better score than CapDec, as they exploit the additional supervision of image-text pairs. Nevertheless, compared to the unsupervised approaches of MAGIC (Su et al, 2022) and Zero-Cap (Tewel et al, 2022), CapDec achieves superior scores. Note that ZeroCap does not require any training data, while MAGIC requires text data similar to our setting.…”
Section: Resultsmentioning
confidence: 96%
“…CLIP (2021) marked a turning point in visionlanguage perception, and has been utilized for vision-related tasks by various distillation techniques Song et al, 2022;Jin et al, 2021;Gal et al, 2021;Khandelwal et al, 2022). Recent captioning methods use CLIP for reducing training time (Mokady et al, 2021), improved captions (Shen et al, 2021;Luo et al, 2022a,b;Cornia et al, 2021;Kuo and Kira, 2022), and in zero-shot settings (Su et al, 2022;Tewel et al, 2022). However, zero-shot techniques often result in inferior performance, as the produced captions are not compatible with the desired target style, which is usually dictated by a dataset.…”
Section: Related Workmentioning
confidence: 99%
“…A game engine can be modified to produce both graphical and textual output, which can then be used for bug detection. However, during a preliminary study, we tested CLIP-Cap (Mokady, Hertz, and Bermano 2021), ZeroCap (Tewel et al 2022) and OFA (Wang et al 2022) to create descriptions of videos, and found that none of them can describe frames from video games properly. Future studies should investigate how the description of event sequences can be automated.…”
Section: Future Research Directionsmentioning
confidence: 99%
“…In AI, the most widely known and arguably best performing models (such as GPT-3, Gato, AlphaGo, DALL-E, etc.) are those which use some form of backpropagation or reinforcement learning as the driving learning algorithm (Silver et al, 2016 ; Gundersen and Kjensmo, 2018 ; Brown et al, 2020 ; Zhang and Lu, 2021 ; Reed et al, 2022 ; Tewel et al, 2022 ). Backpropagation is a type of consequence feedback since an error signal derived between the current behavior and desired behavior is propagated mathematically to each node in the network.…”
Section: Introductionmentioning
confidence: 99%