Object-Centric Unsupervised Image Captioning

Meng, Zihang; Yang, David; Cao, Xuefei; Shah, Ashish; Lim, Ser-Nam

doi:10.1007/978-3-031-20059-5_13

Cited by 7 publications

(10 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Subsequently, Laina, Rupprecht, and Navab (2019) propose a shared multi-modal space constructed through visual concepts to align images and text. Meng et al (2022) suggest harvesting objects corresponding to given sentences instead of finding candidate images. Nonetheless, these approaches depend heavily on object detectors, overlooking object attributes and relationships, constrained by detector generalization.…”

Section: Unsupervised Image Captioningmentioning

confidence: 99%

“…To measure the quality of generated captions, we follow prior studies Meng et al 2022) and employ metrics such as BLEU (Papineni et al 2002), ME-TEOR (Banerjee and Lavie 2005), ROUGE (Lin 2004), and CIDEr-D (Vedantam, Lawrence Zitnick, and Parikh 2015). Implementation Details.…”

Section: Settingsmentioning

confidence: 99%

“…In the absence of paired image-text data, unsupervised strategies often forge noisy image-text pairs from independent datasets for model bootstrapping (Feng et al 2019) or domain alignment (Laina, Rupprecht, and Navab 2019), leveraging a pretrained object detector. Moreover, Meng et al (2022) suggest establishing object-text pairs by associating objects with sentences, rather than seeking candidates within the image collection. These techniques operate under the assumption that a pretrained detector can consistently discern visual concepts, thus establishing connections between disparate images and text.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Image Captioning with Multi-Context Synthetic Data

Ma,

Zhou,

Rao

et al. 2024

AAAI

View full text Add to dashboard Cite

Image captioning requires numerous annotated image-text pairs, resulting in substantial annotation costs. Recently, large models (e.g. diffusion models and large language models) have excelled in producing high-quality images and text. This potential can be harnessed to create synthetic image-text pairs for training captioning models. Synthetic data can improve cost and time efficiency in data collection, allow for customization to specific domains, bootstrap generalization capability for zero-shot performance, and circumvent privacy concerns associated with real-world data. However, existing methods struggle to attain satisfactory performance solely through synthetic data. We identify the issue as generated images from simple descriptions mostly capture a solitary perspective with limited context, failing to align with the intricate scenes prevalent in real-world imagery. To tackle this, we present an innovative pipeline that introduces multi-context data generation. Beginning with an initial text corpus, our approach employs a large language model to extract multiple sentences portraying the same scene from diverse viewpoints. These sentences are then condensed into a single sentence with multiple contexts. Subsequently, we generate intricate images using the condensed captions through diffusion models. Our model is exclusively trained on synthetic image-text pairs crafted through this process. The effectiveness of our pipeline is validated through experimental results in both the in-domain and cross-domain settings, where it achieves state-of-the-art performance on well-known datasets such as MSCOCO, Flickr30k, and NoCaps.

show abstract

Section: Unsupervised Image Captioningmentioning

confidence: 99%

Section: Settingsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Image Captioning with Multi-Context Synthetic Data

Ma,

Zhou,

Rao

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Most of aforementioned works including [3], [5], [17], [20], [21], [22], [31], [40], [49] exploit large auxiliary supervised datasets such as class labels or scene graph. To the best of our knowledge, we are the first to study how to handle unpaired image and caption data for image captioning even without any auxiliary information but by leveraging semi-supervised image-caption data only.…”

Section: Related Workmentioning

confidence: 99%

Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data

Kim¹,

Oh²,

Choi³

et al. 2023

Preprint

View full text Add to dashboard Cite

“…Unpaired captioning methods use independent image sources and text corpus for training. They mostly conduct adversarial training or detect objects to establish the connection between images and texts (Ben et al 2021;Meng et al 2022;Zhu et al 2023;Yu et al 2023). Without training, ZeroCap (Tewel et al 2022) proposes to integrate CLIP and a language model, where CLIP assumes the role of guiding the language model toward a specific visual direction.…”

Section: Related Workmentioning

confidence: 99%

A new amber outcrop from the Oligocene Nanning Basin of Guangxi, southern China

Liu¹,

Guang-chun

Lian

et al. 2021

View full text Add to dashboard Cite

Ambers in China have been described from the various localities of both Cretaceous (e.g., Xixia amber from Henan Province and Jalainur [Zhalainuoer] amber from northeastern Inner Mongolia) and Palaeogene (e.g., Eocene Fushun amber of Liaoning Province and Miocene Zhangpu amber of Fujian Province) ages to date (e.g., Hong, 1981, 2002; Shi et al., 2014; Wang et al., 2014; Azar et al., 2019; Wang et al., 2021). Here we report a new amber locality from the Late Oligocene of Nanning Basin, Guangxi, southern China. The first amber piece was collected by one of the authors (GCZ) on 5 June 2008. In a recent field work in early 2021, we further discovered more than 50 smaller amber pieces, which are reported here.

show abstract

Object-Centric Unsupervised Image Captioning

Cited by 7 publications

References 32 publications

Image Captioning with Multi-Context Synthetic Data

Image Captioning with Multi-Context Synthetic Data

Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data

A new amber outcrop from the Oligocene Nanning Basin of Guangxi, southern China

Contact Info

Product

Resources

About