Informative Image Captioning with External Sources of Information

Zhao, Sanqiang; Sharma, Piyush; Levinboim, Tomer; Soricut, Radu

doi:10.18653/v1/p19-1650

Cited by 26 publications

(19 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Following (Zhao et al, 2019), we obtain the subjects' ratings for fidelity (the first caption is superior in terms of making less mistakes? ), informativeness (the first caption provides more informative and detailed description?…”

Section: Quantitative Analysismentioning

confidence: 99%

Improving Image Captioning with Better Use of Caption

Shi¹,

Zhou²,

Qiu³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. The representation is then enhanced with neighbouring and contextual nodes with their textual and visual features. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences. We perform extensive experiments on the MSCOCO dataset, showing that the proposed framework significantly outperforms the baselines, resulting in the state-of-the-art performance under a wide range of evaluation metrics. The code of our paper has been made publicly available. 1

show abstract

Section: Quantitative Analysismentioning

confidence: 99%

Improving Image Captioning with Better Use of Caption

Shi¹,

Zhou²,

Qiu³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…To overcome the limitations imposed by the automatic metrics, several studies evaluate their models using hu-man judgments (Sharma et al 2018;Zhao et al 2019;Dognin et al 2019;Forbes et al 2019). However, none of them utilizes the human-rated captions in the model evaluations.…”

Section: Related Workmentioning

confidence: 99%

“…Image captioning is the task of automatically generating fluent natural language descriptions for an input image. However, measuring the quality of generated captions in an automatic manner is a challenging and yet-unsolved task; therefore, human evaluations are often required to assess the complex semantic relationships between a visual scene and a generated caption (Sharma et al 2018;Cui et al 2018;Zhao et al 2019). As a result, there is a mismatch between the training objective of the captioning models and their final evaluation criteria.…”

Section: Introductionmentioning

confidence: 99%

“…As a result of the need to understand the performance of the current models, human evaluation studies for measuring caption quality are frequently reported in the literature (Sharma et al 2018;Forbes et al 2019;Dognin et al 2019;Zhao et al 2019). In addition to an aggregate model performance, such human evaluation studies also produce a valuable by-product: a dataset of model-generated image captions with human annotated quality labels, as shown in Figure 1b. We argue that such a by-product, henceforth called a caption ratings dataset, can be successfully used to improve the quality of image captioning models, for several reasons.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Reinforcing an Image Caption Generator Using Off-Line Human Feedback

Seo

Sharma

Levinboim

et al. 2020

AAAI

Self Cite

View full text Add to dashboard Cite

Human ratings are currently the most accurate way to assess the quality of an image captioning model, yet most often the only used outcome of an expensive human rating evaluation is a few overall statistics over the evaluation dataset. In this paper, we show that the signal from instance-level human caption ratings can be leveraged to improve captioning models, even when the amount of caption ratings is several orders of magnitude less than the caption training data. We employ a policy gradient method to maximize the human ratings as rewards in an off-policy reinforcement learning setting, where policy gradients are estimated by samples from a distribution that focuses on the captions in a caption ratings dataset. Our empirical evidence indicates that the proposed method learns to generalize the human raters' judgments to a previously unseen set of images, as judged by a different set of human judges, and additionally on a different, multi-dimensional side-by-side human evaluation procedure.

show abstract

“…The capability of the machines to comprehend and complete users' goals has empowered researchers to build advanced dialogue systems. With the progress in visual question answering [1,2] and image captioning [3,4], the use of different modalities in dialogue agents has shown remarkable performance bringing the different areas of computer vision (CV) and natural language processing (NLP) together. Hence, multimodal dialogue system bridges the gap between vision and language, ensuring interdisciplinary research.…”

Section: Introductionmentioning

confidence: 99%

More to diverse: Generating diversified responses in a task oriented multimodal dialog system

2020

View full text Add to dashboard Cite

Multimodal dialogue system, due to its many-fold applications, has gained much attention to the researchers and developers in recent times. With the release of large-scale multimodal dialog dataset Saha et al. 2018 on the fashion domain, it has been possible to investigate the dialogue systems having both textual and visual modalities. Response generation is an essential aspect of every dialogue system, and making the responses diverse is an important problem. For any goal-oriented conversational agent, the system’s responses must be informative, diverse and polite, that may lead to better user experiences. In this paper, we propose an end-to-end neural framework for generating varied responses in a multimodal dialogue setup capturing information from both the text and image. Multimodal encoder with co-attention between the text and image is used for focusing on the different modalities to obtain better contextual information. For effective information sharing across the modalities, we combine the information of text and images using the BLOCK fusion technique that helps in learning an improved multimodal representation. We employ stochastic beam search with Gumble Top K-tricks to achieve diversified responses while preserving the content and politeness in the responses. Experimental results show that our proposed approach performs significantly better compared to the existing and baseline methods in terms of distinct metrics, and thereby generates more diverse responses that are informative, interesting and polite without any loss of information. Empirical evaluation also reveals that images, while used along with the text, improve the efficiency of the model in generating diversified responses.

show abstract

Informative Image Captioning with External Sources of Information

Cited by 26 publications

References 24 publications

Improving Image Captioning with Better Use of Caption

Improving Image Captioning with Better Use of Caption

Reinforcing an Image Caption Generator Using Off-Line Human Feedback

More to diverse: Generating diversified responses in a task oriented multimodal dialog system

Contact Info

Product

Resources

About