2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00608
|View full text |Cite
|
Sign up to set email alerts
|

Learning to Evaluate Image Captioning

Abstract: Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, ME-TEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rulebased metrics lack provisions to repair such blind spots once identified. For example, the newly proposed SPICE correlates well with human judgments, but fails to capture the syntactic structure of a sentence. To address these two challen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
102
2
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 116 publications
(108 citation statements)
references
References 25 publications
0
102
2
1
Order By: Relevance
“…We compare our metrics with mBLEU mix = 1 − 1 4 4 n=1 mBLEU n , which accounts for mBLEU- 4 In our instructions, diversity refers to different words, phrases, sentence structures, semantics or other factors that impact diversity. {1,2,3,4}, and we invert the score so that it is consistent with our diversity metrics (higher values indicate more diversity).…”
Section: Considering Diversity and Accuracymentioning
confidence: 99%
“…We compare our metrics with mBLEU mix = 1 − 1 4 4 n=1 mBLEU n , which accounts for mBLEU- 4 In our instructions, diversity refers to different words, phrases, sentence structures, semantics or other factors that impact diversity. {1,2,3,4}, and we invert the score so that it is consistent with our diversity metrics (higher values indicate more diversity).…”
Section: Considering Diversity and Accuracymentioning
confidence: 99%
“…In contrast, human annotations tend to have varying captions, since the background knowledge of each person varies, leading to lower BMRC metrics. SPICE [1] has been shown to be better correlated with human judgement [6]. Unfortunately, the SPICE metric is currently not available from the online test server.…”
Section: Results On Online Test Setmentioning
confidence: 99%
“…[39,31,57]. [8] train a general critic network to learn to score captions, providing various types of corrupted captions as negatives. [51] use a composite metric, a classifier trained on the automatic scores as input.…”
Section: Related Workmentioning
confidence: 99%