Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation

Bai, Zechen; Nakashima, Yuta; García, Noa

doi:10.1109/iccv48922.2021.00537

Cited by 22 publications

(11 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While it is well known that BLEU-type metrics are suboptimal in terms of correlation with human judgements (Reiter, 2018;Sulem et al, 2018;Kilickaya et al, 2017), especially if there is only one reference ground truth caption per image (and not multiple, like in MSCOCO or Flickr30k), they still provide a reliable way to capture the differences between the alternative models. (Hu et al, 2020;Tran et al, 2020;Zhao et al, 2021;Bai et al, 2021). Although a direct comparison between these models and the ones developed in this dissertation is not possible due to the differences in the datasets and the task specifics, the metric scores reported in these works are close to ours, with BLEU-4 ranging from 1.71 to 8.8 and CIDEr ranging from 9.1 to 54.47.…”

Section: Quantitative Evaluationsupporting

confidence: 58%

“…Importantly, the anchor is separate from the image itself and can be non-visual. In previous research, the connection to external knowledge was often established through object detection or image classification (Mogadala et al, 2018;Zhou et al, 2019;Huang et al, 2020;Bai et al, 2021), leaving unexplored the potential benefits of utilizing the associated nonvisual data. For example, certain elements of image metadata, such as the coordinates of its location or the date and time of its creation, can be used as an anchor, since they provide information about the circumstances in which the image originated and thus can help identify relevant entities and events.…”

Section: Identification Of Relevant Knowledgementioning

confidence: 99%

“…Integrating external encyclopedic data into image captioning has not been the focus of much prior research, although the few existing works (Mogadala et al, 2018;Zhou et al, 2019;Huang et al, 2020;Bai et al, 2021) show its potential for improving informativeness and overall quality of the generated captions.…”

Section: Enhancing Caption Generation With Encyclopedic Datamentioning

confidence: 99%

See 2 more Smart Citations

Image Captioning with External Knowledge

Nikiforova¹

View full text Add to dashboard Cite

This dissertation is dedicated to image captioning, the task of automatically generating a natural language description of a given image. Most modern automatic caption generators are trained to produce a straightforward visual description of what can be directly seen in the image. By contrast, a human-written caption may also include information that cannot be inferred from the image alone: references to image-external world knowledge. Exploring ways to enrich automatic image captioning with contextually relevant external knowledge is the main focus of this dissertation. The general approach we develop begins with the identification and extraction of relevant external knowledge. This task is carried out by a contextualization anchor, an element of image-related data that is used to determine which part of the world knowledge available in external resources would be useful for captioning a given image. Through the contextualization anchor, we identify real world entities that are relevant to the image, which make up an entity context. We further retrieve various facts about these entities, creating an informative knowledge context. We integrate both entity and knowledge contexts into a neural encoder-decoder captioning pipeline as extra sources of information for generating the caption. The goal of the resulting “knowledge-aware” captioning model is to generate captions that are influenced by the relevant external knowledge and possibly include explicit references to it. During evaluation, we pay special attention to measuring factual accuracy, the veridicality of image-external knowledge in the automatically generated captions. Based on this approach, we develop three image captioning models. Their training data, which includes two new datasets we compile, contains naturally produced captions with abundant references to external knowledge. The first model focuses on geographic knowledge in particular. It uses image location metadata as a contextualization anchor to identify geographic entities in and around the image. These entities make up the geographic entity context, which provides extra input for the encoder and an additional vocabulary for the decoder, allowing it to generate entity names in the captions. The evaluation shows a substantial improvement over the standard baseline models, particularly in the ability to correctly produce specific geographic references. The second model additionally includes the knowledge context, which consists of diverse encyclopedic facts about the relevant entities. It is used as another input to the encoder, and in the decoder it provides extra contextualization for the generation of regular words and another vocabulary for generating fact-related tokens. In our experiments, this model confidently outperforms various baseline models in standard captioning metrics and, importantly, in the accuracy of the generated facts. The third model extends beyond the geographic domain and applies our approach to the qualitatively different data: images from newspaper articles. Here, the article itself is used as a contextualization anchor, the entity context is constructed from named entities of various types (not only geographic), collected from the article text, and the knowledge context includes encyclopedic facts about these entities. The resulting model is able to generate contextualized captions that incorporate information from both the article and an external knowledge base.

show abstract

Section: Quantitative Evaluationsupporting

confidence: 58%

Section: Identification Of Relevant Knowledgementioning

confidence: 99%

Section: Enhancing Caption Generation With Encyclopedic Datamentioning

confidence: 99%

See 1 more Smart Citation

Image Captioning with External Knowledge

Nikiforova¹

View full text Add to dashboard Cite

show abstract

“…In CLIP-Art [7], the contrastive vision-language loss from CLIP [27] is used to fine-tune on the iMet collection [40], leading to improvements on downstream multimodal retrieval and classification tasks for paintings. [3] present a framework for generating informative painting captions based on masked sentence generation using LSTM and knowledge retrieval using TF-IDF vectors. Authors report their experimental results on the SemArt collection [11].…”

Section: Related Workmentioning

confidence: 99%

ArtQuest: Countering Hidden Language Biases in ArtVQA

Bleidt,

Eslami,

de Melo

2024

2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

The task of Visual Question Answering (VQA) has been studied extensively on general-domain real-world images. Transferring insights from general domain VQA to the art domain (ArtVQA) is non-trivial, as the latter requires models to identify abstract concepts, details of brushstrokes and styles of paintings in the visual data as well as possess background knowledge about art. This is exacerbated by the lack of high-quality datasets. In this work, we shed light on hidden linguistic biases in the AQUA dataset, which is the only publicly available benchmark dataset for ArtVQA. As a result, the majority of questions can be answered without consulting the visual information, making the "V" in ArtVQA rather insignificant. In order to counter this problem, we create a simple, yet practical dataset, ArtQuest, using structured information from the SemArt collection. Our dataset and the pipeline to reproduce our results are publicly available at https://github.com/bletib/artquest.

show abstract

“…the images contained have vastly greater style variation than existing datasets [42,16] which predominantly focus on small subsets of fine art, sometimes further limited to only European or Asian [3,27,53]. Not only is StyleBabel's domain more diverse, but our annotations also differ.…”

Section: Introductionmentioning

confidence: 99%

StyleBabel: Artistic Style Tagging and Captioning

Ruta

Gilbert

Aggarwal

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

We present StyleBabel, a unique open access dataset of natural language captions and free-form tags describing the artistic style of over 135K digital artworks, collected via a novel participatory method from experts studying at specialist art and design schools. StyleBabel was collected via an iterative method, inspired by 'Grounded Theory': a qualitative approach that enables annotation while co-evolving a shared language for fine-grained artistic style attribute description. We demonstrate several downstream tasks for StyleBabel, adapting the recent ALADIN architecture for fine-grained style similarity, to train cross-modal embeddings for: 1) free-form tag generation; 2) natural language description of artistic style; 3) fine-grained text search of style. To do so, we extend ALADIN with recent advances in Visual Transformer (ViT) and cross-modal representation learning, achieving a state of the art accuracy in fine-grained style retrieval.

show abstract

Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation

Cited by 22 publications

References 43 publications

Image Captioning with External Knowledge

Image Captioning with External Knowledge

ArtQuest: Countering Hidden Language Biases in ArtVQA

StyleBabel: Artistic Style Tagging and Captioning

Contact Info

Product

Resources

About