2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00751
|View full text |Cite
|
Sign up to set email alerts
|

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Abstract: Understanding images without explicit supervision has become an important problem in computer vision. In this paper, we address image captioning by generating language descriptions of scenes without learning from annotated pairs of images and their captions. The core component of our approach is a shared latent space that is structured by visual concepts. In this space, the two modalities should be indistinguishable. A language model is first trained to encode sentences into semantically structured embeddings.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
85
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 87 publications
(91 citation statements)
references
References 58 publications
(78 reference statements)
0
85
0
1
Order By: Relevance
“…Similarly, Peri et al [53] proposed a multimodal framework that encodes both images and captions using CNN and RNN as an intermediate level representation and then decodes these multimodal representations into a new caption that is similar to the input. The authors of [128] presented an unsupervised image captioning framework based on a new alignment method that allows the simultaneous integration of visual and textual streams through semantic learning of multimodal embeddings of the language and vision domains. Moreover, a multimodal model can also aggregate motion information [174], acoustic information [175], temporal information [176], etc.…”
Section: Image Captioningmentioning
confidence: 99%
“…Similarly, Peri et al [53] proposed a multimodal framework that encodes both images and captions using CNN and RNN as an intermediate level representation and then decodes these multimodal representations into a new caption that is similar to the input. The authors of [128] presented an unsupervised image captioning framework based on a new alignment method that allows the simultaneous integration of visual and textual streams through semantic learning of multimodal embeddings of the language and vision domains. Moreover, a multimodal model can also aggregate motion information [174], acoustic information [175], temporal information [176], etc.…”
Section: Image Captioningmentioning
confidence: 99%
“…Unsupervised Grounding Learning Prior works have explored learning grounding with weak or no supervision (Rohrbach et al, 2016;Xiao et al, 2017;. Closest to this paper is unsupervised image captioning (Feng et al, 2019;Laina et al, 2019;Gu et al, 2019), which conducts image captioning with unpaired images and captions. Similar to this work, the detector tags serve as the anchor points for image captioning.…”
Section: Unsupervised Multi-lingual Language Modelmentioning
confidence: 99%
“…It is expensive to take long hours of laboring to collect such labeled data, thus there is a strong interest to develop the algorithm which does not need a lot of annotated examples. Some studies embed the visual feature and text imformation into a mutual space and design unsupervised learning algorithm to reduce the requirement for annotated data (Gu et al, 2018), (Gu et al, 2019), (Laina et al, 2019), (Feng et al, 2019). However, the performances of such algorithms are poor because they do not use pairs of labeled examples at all.…”
Section: Introductionmentioning
confidence: 99%