Densely Supervised Hierarchical Policy-Value Network for Image Paragraph Generation

Wu, Siying; Zha, Zheng-Jun; Wang, Zilei; Li, Houqiang; Wu, Feng

doi:10.24963/ijcai.2019/137

Cited by 13 publications

(15 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare our HSGED(SLL) with several state-of-the-art models: Regions-Hierarchical [13], RTT-GAN [17], DCPG [5], HCAVP [46], DHPV [36], CAE-LSTM [34], TDPG [23] and CRL [22]. Among these methods, RTT-GAN, DCPG, Regions-Hierarchical, DHPV, Hierarchical CAVP and CAE-LSTM use HRNNs with different technique details.…”

Section: Comparing Methodsmentioning

confidence: 99%

“…Researchers also propose advanced techniques to refine the prototypical HRNN, e.g., generative models like GAN [17] or VAE [5] for stronger consistency; the trigram repetition penalty based sampling method for diversity [23]. Besides, dense sentencelevel rewards [36] and curiosity-driven reinforcement learning [22] are used for more robust training, all of which could also be applied in our proposed framework, HSGED. However, most of them are built without enough hierarchical constraints, so the qualities of the generated paragraphs are unsatisfactory.…”

Section: Related Workmentioning

confidence: 99%

“…In this way, a more informative paragraph can be generated, as in Figure 1 (d), the generated sentences cover more topics than those generated by the flat RNN in Figure 1 (c). However, the multi-level RNNs in HRNN are built without any hierarchical constraints [5,13,17,36,46], which results in two problems. 1) The topics are not coherent and distinctive since they are formed by randomly sampling some object sub-sets without any global constraint.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning

Gao

Zhang

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

When we humans tell a long paragraph about an image, we usually first implicitly compose a mental "script" and then comply with it to generate the paragraph. Inspired by this, we render the modern encoder-decoder based image paragraph captioning model such ability by proposing Hierarchical Scene Graph Encoder-Decoder (HSGED) for generating coherent and distinctive paragraphs. In particular, we use the image scene graph as the "script" to incorporate rich semantic knowledge and, more importantly, the hierarchical constraints into the model. Specifically, we design a sentence scene graph RNN (SSG-RNN) to generate sub-graph level topics, which constrain the word scene graph RNN (WSG-RNN) to generate the corresponding sentences. We propose irredundant attention in SSG-RNN to improve the possibility of abstracting topics from rarely described sub-graphs and inheriting attention in WSG-RNN to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs. An efficient sentence-level loss is also proposed for encouraging the sequence of generated sentences to be similar to that of the ground-truth paragraphs. We validate HSGED on Stanford image paragraph dataset and show that it not only achieves a new state-of-the-art 36.02 CIDEr-D, but also generates more coherent and distinctive paragraphs under various metrics. CCS CONCEPTS • Computing methodologies → Computer vision tasks.

show abstract

Section: Comparing Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning

Gao

Zhang

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Our GeoRic dataset consists of 29,038 images from the Geograph project website, with captions and location coordinates. We selected captions that are exactly one sentence long (multi-sentence caption generation, although a promising research direction (Mao et al, 2018;Wu et al, 2019), is not addressed in this work) and include at least one spatial expression, such as "near", "north of", "across", etc. (in order to ensure that the captions contain enough geographic referencing).…”

Section: The Georic Datasetmentioning

confidence: 99%

Geo-Aware Image Caption Generation

Nikiforova¹,

Deoskar²,

Paperno³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Standard image caption generation systems produce generic descriptions of images and do not utilize any contextual information or world knowledge. In particular, they are unable to generate captions that contain references to the geographic context of an image, for example, the location where a photograph is taken or relevant geographic objects around an image location. In this paper, we develop a geo-aware image caption generation system, which incorporates geographic contextual information into a standard image captioning pipeline. We propose a way to build an image-specific representation of the geographic context and adapt the caption generation network to produce appropriate geographic names in the image descriptions. We evaluate our system on a novel captioning dataset that contains contextualized captions and geographic metadata and achieve substantial improvements in BLEU, ROUGE, METEOR and CIDEr scores. We also introduce a new metric to assess generated geographic references directly and empirically demonstrate our system's ability to produce captions with relevant and factually accurate geographic referencing.

show abstract

“…Besides, dense sentence-level rewards [49] and curiosity-driven reinforcement learning [50] are used for more robust training, all of which could also be applied in our proposed framework, HSGED. However, most of them are built without enough 2.3.…”

Section: Image Paragraph Captioningmentioning

confidence: 99%

Incorporating additional knowledge into image captioners

Yang¹

View full text Add to dashboard Cite

show abstract

Densely Supervised Hierarchical Policy-Value Network for Image Paragraph Generation

Cited by 13 publications

References 8 publications

Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning

Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning

Geo-Aware Image Caption Generation

Incorporating additional knowledge into image captioners

Contact Info

Product

Resources

About