2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.01042
|View full text |Cite
|
Sign up to set email alerts
|

Unpaired Image Captioning via Scene Graph Alignments

Abstract: Most of current image captioning models heavily rely on paired image-caption datasets. However, getting large scale image-caption paired data is labor-intensive and time-consuming. In this paper, we present a scene graphbased approach for unpaired image captioning. Our framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, and a sentence decoder. Specifically, we first train the scene graph encoder and the sentence decoder on the text modality. To align the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
115
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
3
2

Relationship

1
9

Authors

Journals

citations
Cited by 139 publications
(115 citation statements)
references
References 35 publications
0
115
0
Order By: Relevance
“…Scene segmentation (or scene parsing, semantic segmentation) is one of the fundamental problems in computer vision and has drawn lots of attentions. Recently, thanks to the great success of Convolutional Neural Networks (CNNs) in computer vision [42,68,71,52,25,72,27,80,26], lots of CNNs based segmentation works have been proposed and have achieved great progress [29,22,81,83,84,70,60]. For example, Long et al [54] introduce the fully convolutional networks (FCN) in which the fully connected layers in standard CNNs are transformed to convolutional layers.…”
Section: Related Work 21 Scene Segmentationmentioning
confidence: 99%
“…Scene segmentation (or scene parsing, semantic segmentation) is one of the fundamental problems in computer vision and has drawn lots of attentions. Recently, thanks to the great success of Convolutional Neural Networks (CNNs) in computer vision [42,68,71,52,25,72,27,80,26], lots of CNNs based segmentation works have been proposed and have achieved great progress [29,22,81,83,84,70,60]. For example, Long et al [54] introduce the fully convolutional networks (FCN) in which the fully connected layers in standard CNNs are transformed to convolutional layers.…”
Section: Related Work 21 Scene Segmentationmentioning
confidence: 99%
“…Generally, a convolutional neural network (CNN) is tasked with feature extraction, while a recurrent neural network (RNN) relies upon to translate the training annotations with the image features [115]. Aside from determining and extracting salient and intricate details in an image, it is equally important to extract the interactions and semantic relationship between such objects and how to illustrate them in the right manner using appropriate tenses and sentence structures [116]. Also, because the training 6 Complexity labels which are texts are different from the features obtained from the images, language model techniques are required to analyze the form, meaning, and context of a sequence of words.…”
Section: Image Captioningmentioning
confidence: 99%
“…Wang et al (2019b) show that image scene graphs extracted using a trained model can match the captioning performance of an oracle with access to ground-truth graphs. Aligning text-and imagebased scene graphs has also been used to generate image captions without paired data (Gu et al, 2019).…”
Section: Related Workmentioning
confidence: 99%