Exploring Visual Relationship for Image Captioning

Yao, Ting; Pan, Yingwei; Li, Yehao; Mei, Tao

doi:10.1007/978-3-030-01264-9_42

Cited by 744 publications

(560 citation statements)

References 36 publications

Supporting

Mentioning

552

Contrasting

Unclassified

Order By: Relevance

“…We report the performance on the offline test split of our model as well as the compared models in Table 1. The models include: LSTM [37], which encodes the image using CNN and decodes it using LSTM; SCST [31], which employs a modified visual attention and is the first to use SCST to directly optimize the evaluation metrics; Up-Down [2], which employs a two-LSTM layer model with bottom-up features extracted from Faster-RCNN; RFNet [20], which fuses encoded features from multiple CNN networks; GCN-LSTM [49], which predicts visual relationships between every two entities in the image and encodes the relationship information into feature vectors; and SGAE [44], which introduces auto-encoding scene graphs into its model.…”

Section: Quantitative Analysismentioning

confidence: 99%

Attention on Attention for Image Captioning

Huang

Wang

Chen

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

749

611

View full text Add to dashboard Cite

Attention mechanisms are widely used in current encoder/decoder frameworks of image captioning, where a weighted average on encoded vectors is generated at each time step to guide the caption decoding process. However, the decoder has little idea of whether or how well the attended vector and the given attention query are related, which could make the decoder give misled results. In this paper, we propose an "Attention on Attention" (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries. AoA first generates an "information vector" and an "attention gate" using the attention result and the current context, then adds another attention by applying element-wise multiplication to them and finally obtains the "attended information", the expected useful knowledge. We apply AoA to both the encoder and the decoder of our image captioning model, which we name as AoA Network (AoANet). Experiments show that AoANet outperforms all previously published methods and achieves a new state-ofthe-art performance of 129.8 CIDEr-D score on MS COCO "Karpathy" offline test split and 129.6 CIDEr-D (C40) score on the official online testing server. Code is available at https://github.com/husthuaan/AoANet.

show abstract

Section: Quantitative Analysismentioning

confidence: 99%

Attention on Attention for Image Captioning

Huang

Wang

Chen

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

749

611

View full text Add to dashboard Cite

show abstract

“…Since we design our HIP architecture to be a feature refiner or bank that outputs rich and multi-level representations of the image, it is feasible to plug HIP into any neural captioning models. We next discuss how to integrate hierarchy parsing into a general attention-based LSTM decoder in Up-Down [3] or a specific relation-augmented decoder in GCN-LSTM [34]. Please also note that our HIP is flexible to generalize to other vision tasks, e.g., recognition.…”

Section: Image Captioning With Hierarchy Parsingmentioning

confidence: 99%

Hierarchy Parsing for Image Captioning

Yao

Pan

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

186

View full text Add to dashboard Cite

It is always well believed that parsing an image into constituent visual patterns would be helpful for understanding and representing an image. Nevertheless, there has not been evidence in support of the idea on describing an image with a natural-language utterance. In this paper, we introduce a new design to model a hierarchy from instance level (segmentation), region level (detection) to the whole image to delve into a thorough image understanding for captioning. Specifically, we present a HIerarchy Parsing (HIP) architecture that novelly integrates hierarchical structure into image encoder. Technically, an image decomposes into a set of regions and some of the regions are resolved into finer ones. Each region then regresses to an instance, i.e., foreground of the region. Such process naturally builds a hierarchal tree. A tree-structured Long Short-Term Memory (Tree-LSTM) network is then employed to interpret the hierarchal structure and enhance all the instancelevel, region-level and image-level features. Our HIP is appealing in view that it is pluggable to any neural captioning models. Extensive experiments on COCO image captioning dataset demonstrate the superiority of HIP. More remarkably, HIP plus a top-down attention-based LSTM decoder increases CIDEr-D performance from 120.1% to 127.2% on COCO Karpathy test split. When further endowing instance-level and region-level features from HIP with semantic relation learnt through Graph Convolutional Networks (GCN), CIDEr-D is boosted up to 130.6%.

show abstract

“…Object relations characterize the interactions or geometric positions between objects. In the literature, there has been strong evidences on the use of object relation to support various vision tasks, e.g., recognition [48], object detection [17], cross-domain detection [2], and image captioning [52]. One representative work that employs object relation is [17] for object detection in images.…”

Section: Introductionmentioning

confidence: 99%

Relation Distillation Networks for Video Object Detection

Deng

Pan

Yao

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

197

152

View full text Add to dashboard Cite

It has been well recognized that modeling object-toobject relations would be helpful for object detection. Nevertheless, the problem is not trivial especially when exploring the interactions between objects to boost video object detectors. The difficulty originates from the aspect that reliable object relations in a video should depend on not only the objects in the present frame but also all the supportive objects extracted over a long range span of the video. In this paper, we introduce a new design to capture the interactions across the objects in spatio-temporal context. Specifically, we present Relation Distillation Networks (RDN) -a new architecture that novelly aggregates and propagates object relation to augment object features for detection. Technically, object proposals are first generated via Region Proposal Networks (RPN). RDN then, on one hand, models object relation via multi-stage reasoning, and on the other, progressively distills relation through refining supportive object proposals with high objectness scores in a cascaded manner. The learnt relation verifies the efficacy on both improving object detection in each frame and box linking across frames. Extensive experiments are conducted on Im-ageNet VID dataset, and superior results are reported when comparing to state-of-the-art methods. More remarkably, our RDN achieves 81.8% and 83.2% mAP with ResNet-101 and ResNeXt-101, respectively. When further equipped with linking and rescoring, we obtain to-date the best reported mAP of 83.8% and 84.7%.

show abstract

Exploring Visual Relationship for Image Captioning

Cited by 744 publications

References 36 publications

Attention on Attention for Image Captioning

Attention on Attention for Image Captioning

Hierarchy Parsing for Image Captioning

Relation Distillation Networks for Video Object Detection

Contact Info

Product

Resources

About