Boost image captioning with knowledge reasoning

Huang, Feicheng; Li, Zhixin; Wei, Haiyang; Zhang, Canlong; Ma, Huifang

doi:10.1007/s10994-020-05919-y

Cited by 26 publications

(7 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The caption will read as though she might be looking for a roadside direction sign board to take down an address or waiting for a bus with her bags by combining external knowledge (Knowledge Graph). 193 To get the caption of an image by Knowledge Graph, we must go with the following steps:…”

Section: Knowledge Graph-based Methods For Image Captioningmentioning

confidence: 99%

Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

Wajid,

Terashima‐Marin,

Najafirad

et al. 2023

Engineering Reports

View full text Add to dashboard Cite

Generating an image/video caption has always been a fundamental problem of Artificial Intelligence, which is usually performed using the potential of Deep Learning Methods, Computer Vision, Knowledge Graphs, and Natural Language Processing (NLP). The significant task of image/video captioning is to describe visual content in terms of natural language. Due to a semantic gap, this presents a massive problem in understanding and explaining images or videos syntactically and semantically. The current systems need somewhere to fill the gap between low‐level and high‐level features while mapping. Therefore, to tackle this problem, there is a need to describe the latest research and methods to overcome difficulties and to propose effective solutions. This work thoroughly analyses and investigates the most related methods (deep learning and knowledge graph‐based approaches), benchmark datasets, and evaluation metrics with their benefits and limitations. Here we have also reviewed the state‐of‐the‐art methods related to image/video captioning and their applications in the current scenario. Finally, we provide thorough information on existing research with comparisons of results on benchmark datasets. We have also mentioned the existing challenges and future direction of research.

show abstract

Section: Knowledge Graph-based Methods For Image Captioningmentioning

confidence: 99%

Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

Wajid,

Terashima‐Marin,

Najafirad

et al. 2023

Engineering Reports

View full text Add to dashboard Cite

show abstract

“…Importantly, the anchor is separate from the image itself and can be non-visual. In previous research, the connection to external knowledge was often established through object detection or image classification (Mogadala et al, 2018;Zhou et al, 2019;Huang et al, 2020;Bai et al, 2021), leaving unexplored the potential benefits of utilizing the associated nonvisual data. For example, certain elements of image metadata, such as the coordinates of its location or the date and time of its creation, can be used as an anchor, since they provide information about the circumstances in which the image originated and thus can help identify relevant entities and events.…”

Section: Identification Of Relevant Knowledgementioning

confidence: 99%

“…Integrating external encyclopedic data into image captioning has not been the focus of much prior research, although the few existing works (Mogadala et al, 2018;Zhou et al, 2019;Huang et al, 2020;Bai et al, 2021) show its potential for improving informativeness and overall quality of the generated captions.…”

Section: Enhancing Caption Generation With Encyclopedic Datamentioning

confidence: 99%

“…The resulting object labels are used to query the ConceptNet knowledge base to retrieve a set of related terms (e.g., "stop sign" −→ "bus", "bus station", etc.). In Zhou et al (2019) the embeddings of the related terms, combined with Encyclopedic knowledge-aware image captioning 59 the image features, initialize the caption generation module 2 ; in Huang et al (2020) the probability of generating the related terms is increased at decoding time. To test the efficacy of this methodology on our data and to compare it to the approach proposed in this dissertation, we develop a baseline model that involves common object and scene detection in the image, extracting information about the related concepts from an external database and introducing it into the captioning process.…”

Section: Enhancing Caption Generation With Encyclopedic Datamentioning

confidence: 99%

“…The goal of this baseline is to compare the approach proposed in this dissertation to another one that is based on the existing research (Huang et al, 2020;Zhou et al, 2019, see also Section 4.2). Here, external knowledge that informs caption generation is extracted based solely on the image 5 .…”

Section: Related-conceptsmentioning

confidence: 99%

See 2 more Smart Citations

Image Captioning with External Knowledge

Nikiforova¹

View full text Add to dashboard Cite

This dissertation is dedicated to image captioning, the task of automatically generating a natural language description of a given image. Most modern automatic caption generators are trained to produce a straightforward visual description of what can be directly seen in the image. By contrast, a human-written caption may also include information that cannot be inferred from the image alone: references to image-external world knowledge. Exploring ways to enrich automatic image captioning with contextually relevant external knowledge is the main focus of this dissertation. The general approach we develop begins with the identification and extraction of relevant external knowledge. This task is carried out by a contextualization anchor, an element of image-related data that is used to determine which part of the world knowledge available in external resources would be useful for captioning a given image. Through the contextualization anchor, we identify real world entities that are relevant to the image, which make up an entity context. We further retrieve various facts about these entities, creating an informative knowledge context. We integrate both entity and knowledge contexts into a neural encoder-decoder captioning pipeline as extra sources of information for generating the caption. The goal of the resulting “knowledge-aware” captioning model is to generate captions that are influenced by the relevant external knowledge and possibly include explicit references to it. During evaluation, we pay special attention to measuring factual accuracy, the veridicality of image-external knowledge in the automatically generated captions. Based on this approach, we develop three image captioning models. Their training data, which includes two new datasets we compile, contains naturally produced captions with abundant references to external knowledge. The first model focuses on geographic knowledge in particular. It uses image location metadata as a contextualization anchor to identify geographic entities in and around the image. These entities make up the geographic entity context, which provides extra input for the encoder and an additional vocabulary for the decoder, allowing it to generate entity names in the captions. The evaluation shows a substantial improvement over the standard baseline models, particularly in the ability to correctly produce specific geographic references. The second model additionally includes the knowledge context, which consists of diverse encyclopedic facts about the relevant entities. It is used as another input to the encoder, and in the decoder it provides extra contextualization for the generation of regular words and another vocabulary for generating fact-related tokens. In our experiments, this model confidently outperforms various baseline models in standard captioning metrics and, importantly, in the accuracy of the generated facts. The third model extends beyond the geographic domain and applies our approach to the qualitatively different data: images from newspaper articles. Here, the article itself is used as a contextualization anchor, the entity context is constructed from named entities of various types (not only geographic), collected from the article text, and the knowledge context includes encyclopedic facts about these entities. The resulting model is able to generate contextualized captions that incorporate information from both the article and an external knowledge base.

show abstract

Adversarial Training for Image Captioning Incorporating Relation Attention

Chen

Zhang

et al. 2021

PRICAI 2021: Trends in Artificial Intelligence

View full text Add to dashboard Cite

Boost image captioning with knowledge reasoning

Cited by 26 publications

References 39 publications

Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

Image Captioning with External Knowledge

Adversarial Training for Image Captioning Incorporating Relation Attention

Contact Info

Product

Resources

About