2021
DOI: 10.48550/arxiv.2106.14019
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning

Abstract: Despite the success of various text generation metrics such as BERTScore, it is still difficult to evaluate the image captions without enough reference captions due to the diversity of the descriptions. In this paper, we introduce a new metric UMIC, an Unreferenced Metric for Image Captioning which does not require reference captions to evaluate image captions. Based on Vision-and-Language BERT, we train UMIC to discriminate negative captions via contrastive learning. Also, we observe critical problems of the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 12 publications
0
8
0
Order By: Relevance
“…In a similar vein, the Alignment score [152] analyzes whether ideas are mentioned in a human-like order by comparing the alignment of noun sequences in candidate and reference sentences. Moreover, the Coverage score [153], [154] calculates the extent of a caption by taking into account the scene's indicated visual elements. This score directly considers visual elements and can be used even without ground-truth captions.…”
Section: ) Embedding-based Metricsmentioning
confidence: 99%
“…In a similar vein, the Alignment score [152] analyzes whether ideas are mentioned in a human-like order by comparing the alignment of noun sequences in candidate and reference sentences. Moreover, the Coverage score [153], [154] calculates the extent of a caption by taking into account the scene's indicated visual elements. This score directly considers visual elements and can be used even without ground-truth captions.…”
Section: ) Embedding-based Metricsmentioning
confidence: 99%
“…Researchers have also proposed unreferenced image captioning metrics that evaluate generated captions by comparing them with original images. For instance, VIFIDEL (Madhyastha et al, 2019) uses the word mover distance (Kusner et al, 2015) between the image and candidate caption, and UMIC (Lee et al, 2021), which fine-tunes UNITER (Chen et al, 2020) using contrastive loss from augmented captions, directly evaluates captions generated from vision-and-language embedding spaces.…”
Section: Related Workmentioning
confidence: 99%
“…Table 3 shows that PR-MCS is an useful image captioning metric with high correlation with human judgment. Flickr8k_Expert (Hodosh et al, 2013) and CapEval1k (Lee et al, 2021) are evaluation sets for measuring the performance of image captioning metric, and the higher the Kendall tau-c (τ c ) value (Kendall, 1938) and the Pearson correlation coefficient (ρ) (Benesty et al, 2009), indicators for viewing the correlation with human judgment, the better. The Kendall tau-c value is the similarity between the two variables based on ranking, and the Pearson correlation coefficient is a measure of linear correlation between two sets of data.…”
Section: Correlations With Human Judgementmentioning
confidence: 99%
“…Sariyildiz et al [65] introduced transformer-based image conditional masking language modelling (ICMLM) for learning the visual representation of image-caption pairs. Lee et al [66] proposed a new metric UMIC, an unreferenced metric for image captioning which does not require reference captions to evaluate image captions, and adopted a pre-trained transformer to generate captions. Yang et al proposed [67] a novel transformer, ReFormer, adapted to generate features embedded in relational information and clearly express the paired relations between objects in images.…”
Section: Enhanced Image Captioning With Attention Correctionmentioning
confidence: 99%
“…Since several commonly used evaluation metrics are more from machine translation, the optimization direction is more inclined to cross-entropy loss. Therefore, researchers are also putting more effort into making more muscular reward functions [66,90,[112][113][114].…”
Section: Recent Trendsmentioning
confidence: 99%