2018
DOI: 10.1007/978-3-030-01264-9_42
|View full text |Cite
|
Sign up to set email alerts
|

Exploring Visual Relationship for Image Captioning

Abstract: It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) arc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
552
0
2

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 744 publications
(560 citation statements)
references
References 36 publications
1
552
0
2
Order By: Relevance
“…We report the performance on the offline test split of our model as well as the compared models in Table 1. The models include: LSTM [37], which encodes the image using CNN and decodes it using LSTM; SCST [31], which employs a modified visual attention and is the first to use SCST to directly optimize the evaluation metrics; Up-Down [2], which employs a two-LSTM layer model with bottom-up features extracted from Faster-RCNN; RFNet [20], which fuses encoded features from multiple CNN networks; GCN-LSTM [49], which predicts visual relationships between every two entities in the image and encodes the relationship information into feature vectors; and SGAE [44], which introduces auto-encoding scene graphs into its model.…”
Section: Quantitative Analysismentioning
confidence: 99%
“…We report the performance on the offline test split of our model as well as the compared models in Table 1. The models include: LSTM [37], which encodes the image using CNN and decodes it using LSTM; SCST [31], which employs a modified visual attention and is the first to use SCST to directly optimize the evaluation metrics; Up-Down [2], which employs a two-LSTM layer model with bottom-up features extracted from Faster-RCNN; RFNet [20], which fuses encoded features from multiple CNN networks; GCN-LSTM [49], which predicts visual relationships between every two entities in the image and encodes the relationship information into feature vectors; and SGAE [44], which introduces auto-encoding scene graphs into its model.…”
Section: Quantitative Analysismentioning
confidence: 99%
“…Since we design our HIP architecture to be a feature refiner or bank that outputs rich and multi-level representations of the image, it is feasible to plug HIP into any neural captioning models. We next discuss how to integrate hierarchy parsing into a general attention-based LSTM decoder in Up-Down [3] or a specific relation-augmented decoder in GCN-LSTM [34]. Please also note that our HIP is flexible to generalize to other vision tasks, e.g., recognition.…”
Section: Image Captioning With Hierarchy Parsingmentioning
confidence: 99%
“…Object relations characterize the interactions or geometric positions between objects. In the literature, there has been strong evidences on the use of object relation to support various vision tasks, e.g., recognition [48], object detection [17], cross-domain detection [2], and image captioning [52]. One representative work that employs object relation is [17] for object detection in images.…”
Section: Introductionmentioning
confidence: 99%