2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00537
|View full text |Cite
|
Sign up to set email alerts
|

Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 22 publications
(11 citation statements)
references
References 43 publications
1
10
0
Order By: Relevance
“…While it is well known that BLEU-type metrics are suboptimal in terms of correlation with human judgements (Reiter, 2018;Sulem et al, 2018;Kilickaya et al, 2017), especially if there is only one reference ground truth caption per image (and not multiple, like in MSCOCO or Flickr30k), they still provide a reliable way to capture the differences between the alternative models. (Hu et al, 2020;Tran et al, 2020;Zhao et al, 2021;Bai et al, 2021). Although a direct comparison between these models and the ones developed in this dissertation is not possible due to the differences in the datasets and the task specifics, the metric scores reported in these works are close to ours, with BLEU-4 ranging from 1.71 to 8.8 and CIDEr ranging from 9.1 to 54.47.…”
Section: Quantitative Evaluationsupporting
confidence: 58%
See 2 more Smart Citations
“…While it is well known that BLEU-type metrics are suboptimal in terms of correlation with human judgements (Reiter, 2018;Sulem et al, 2018;Kilickaya et al, 2017), especially if there is only one reference ground truth caption per image (and not multiple, like in MSCOCO or Flickr30k), they still provide a reliable way to capture the differences between the alternative models. (Hu et al, 2020;Tran et al, 2020;Zhao et al, 2021;Bai et al, 2021). Although a direct comparison between these models and the ones developed in this dissertation is not possible due to the differences in the datasets and the task specifics, the metric scores reported in these works are close to ours, with BLEU-4 ranging from 1.71 to 8.8 and CIDEr ranging from 9.1 to 54.47.…”
Section: Quantitative Evaluationsupporting
confidence: 58%
“…Importantly, the anchor is separate from the image itself and can be non-visual. In previous research, the connection to external knowledge was often established through object detection or image classification (Mogadala et al, 2018;Zhou et al, 2019;Huang et al, 2020;Bai et al, 2021), leaving unexplored the potential benefits of utilizing the associated nonvisual data. For example, certain elements of image metadata, such as the coordinates of its location or the date and time of its creation, can be used as an anchor, since they provide information about the circumstances in which the image originated and thus can help identify relevant entities and events.…”
Section: Identification Of Relevant Knowledgementioning
confidence: 99%
See 1 more Smart Citation
“…In CLIP-Art [7], the contrastive vision-language loss from CLIP [27] is used to fine-tune on the iMet collection [40], leading to improvements on downstream multimodal retrieval and classification tasks for paintings. [3] present a framework for generating informative painting captions based on masked sentence generation using LSTM and knowledge retrieval using TF-IDF vectors. Authors report their experimental results on the SemArt collection [11].…”
Section: Related Workmentioning
confidence: 99%
“…the images contained have vastly greater style variation than existing datasets [42,16] which predominantly focus on small subsets of fine art, sometimes further limited to only European or Asian [3,27,53]. Not only is StyleBabel's domain more diverse, but our annotations also differ.…”
Section: Introductionmentioning
confidence: 99%