Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2021
DOI: 10.18653/v1/2021.naacl-main.253
|View full text |Cite
|
Sign up to set email alerts
|

Quality Estimation for Image Captions Based on Large-scale Human Evaluations

Abstract: Automatic image captioning has improved significantly over the last few years, but the problem is far from being solved, with state of the art models still often producing low quality captions when used in the wild. In this paper, we focus on the task of Quality Estimation (QE) for image captions, which attempts to model the caption quality from a human perspective and without access to groundtruth references, so that it can be applied at prediction time to detect low-quality captions produced on previously un… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
3
0
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 12 publications
(8 citation statements)
references
References 23 publications
(22 reference statements)
0
3
0
1
Order By: Relevance
“…Significant prior work has explored the collection of human judgments of the quality of visual descriptions. Human judgment is considered the gold standard for visual description evaluation, and previous studies typically rely on human annotators to rate caption quality on one or multiple axes (Levinboim et al, 2021;Kasai et al, 2022). While automated methods exist for the evaluation of caption quality (Agarwal and Lavie, 2008;Vedantam et al, 2015;Papineni et al, 2002), recent work including THUMB (Kasai et al, 2022), which has run human evaluations on captions produced by models based on "Precision", "Recall", "Fluency", "Conciseness" and "Inclusive Language", has shown that humans produce captions which score significantly higher when judged by human raters, than when judged by existing measures (and further, that human judgments of quality correlate poorly with existing measures), necessitating the need for human evaluation as opposed to evaluation of captioning methods using automated measures for caption quality.…”
Section: F Human Studiesmentioning
confidence: 99%
“…Significant prior work has explored the collection of human judgments of the quality of visual descriptions. Human judgment is considered the gold standard for visual description evaluation, and previous studies typically rely on human annotators to rate caption quality on one or multiple axes (Levinboim et al, 2021;Kasai et al, 2022). While automated methods exist for the evaluation of caption quality (Agarwal and Lavie, 2008;Vedantam et al, 2015;Papineni et al, 2002), recent work including THUMB (Kasai et al, 2022), which has run human evaluations on captions produced by models based on "Precision", "Recall", "Fluency", "Conciseness" and "Inclusive Language", has shown that humans produce captions which score significantly higher when judged by human raters, than when judged by existing measures (and further, that human judgments of quality correlate poorly with existing measures), necessitating the need for human evaluation as opposed to evaluation of captioning methods using automated measures for caption quality.…”
Section: F Human Studiesmentioning
confidence: 99%
“…Many large-scale datasets have been created [6,[10][11][12][13][14][15][16][17][18][19][20], but in contrast to the previous ones, they employ automated pipelines. One example of a dataset that follows this approach is the Conceptual Captions dataset [5] which has more than 3.3M pairs of images and English captions.…”
Section: Related Workmentioning
confidence: 99%
“…[1][2][3][4][5][6][7][8][9][10],, [101-1000], [1001-10,000],[10,[1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20],000], and 20,001 or more.…”
unclassified
“…QE is widely established in machine translation (MT) tasks (Specia et al, 2013;Martins et al, 2017;Specia et al, 2018). Recently, (Levinboim et al, 2021) introduces a large scale human ratings on image-caption pairs for training QE models in image captioning tasks. Our work also trains caption QE model, (i.e.…”
Section: Related Workmentioning
confidence: 99%