2022
DOI: 10.1007/978-3-031-20059-5_13
|View full text |Cite
|
Sign up to set email alerts
|

Object-Centric Unsupervised Image Captioning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(10 citation statements)
references
References 32 publications
0
10
0
Order By: Relevance
“…Subsequently, Laina, Rupprecht, and Navab (2019) propose a shared multi-modal space constructed through visual concepts to align images and text. Meng et al (2022) suggest harvesting objects corresponding to given sentences instead of finding candidate images. Nonetheless, these approaches depend heavily on object detectors, overlooking object attributes and relationships, constrained by detector generalization.…”
Section: Unsupervised Image Captioningmentioning
confidence: 99%
See 2 more Smart Citations
“…Subsequently, Laina, Rupprecht, and Navab (2019) propose a shared multi-modal space constructed through visual concepts to align images and text. Meng et al (2022) suggest harvesting objects corresponding to given sentences instead of finding candidate images. Nonetheless, these approaches depend heavily on object detectors, overlooking object attributes and relationships, constrained by detector generalization.…”
Section: Unsupervised Image Captioningmentioning
confidence: 99%
“…To measure the quality of generated captions, we follow prior studies Meng et al 2022) and employ metrics such as BLEU (Papineni et al 2002), ME-TEOR (Banerjee and Lavie 2005), ROUGE (Lin 2004), and CIDEr-D (Vedantam, Lawrence Zitnick, and Parikh 2015). Implementation Details.…”
Section: Settingsmentioning
confidence: 99%
See 1 more Smart Citation
“…Most of aforementioned works including [3], [5], [17], [20], [21], [22], [31], [40], [49] exploit large auxiliary supervised datasets such as class labels or scene graph. To the best of our knowledge, we are the first to study how to handle unpaired image and caption data for image captioning even without any auxiliary information but by leveraging semi-supervised image-caption data only.…”
Section: Related Workmentioning
confidence: 99%
“…Unpaired captioning methods use independent image sources and text corpus for training. They mostly conduct adversarial training or detect objects to establish the connection between images and texts (Ben et al 2021;Meng et al 2022;Zhu et al 2023;Yu et al 2023). Without training, ZeroCap (Tewel et al 2022) proposes to integrate CLIP and a language model, where CLIP assumes the role of guiding the language model toward a specific visual direction.…”
Section: Related Workmentioning
confidence: 99%