2022
DOI: 10.1109/tpami.2020.3013834
|View full text |Cite
|
Sign up to set email alerts
|

On Diversity in Image Captioning: Metrics and Methods

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

2
7

Authors

Journals

citations
Cited by 30 publications
(12 citation statements)
references
References 52 publications
0
9
0
Order By: Relevance
“…However, using multiple models and sampling multiple descriptions may lead to redundancy. Devising image captioning models that produce diverse and distinct fine-grained image descriptions may provide improved performance on CLEVER task; there is an active area of research [59,61] that is looking into this problem.…”
Section: Discussionmentioning
confidence: 99%
“…However, using multiple models and sampling multiple descriptions may lead to redundancy. Devising image captioning models that produce diverse and distinct fine-grained image descriptions may provide improved performance on CLEVER task; there is an active area of research [59,61] that is looking into this problem.…”
Section: Discussionmentioning
confidence: 99%
“…The long tail in the empirical distribution of the image caption dataset MSCOCO is introduced in [3]. As in linguistic studies, words with less frequency often contain high information entropy [20], and [52] argues that a long-tail in the word frequency distribution indicates higher diversity in the generated captions. There are two recent works related to the long-tail phenomenon in seq-2-seq models [53], [54].…”
Section: Related Workmentioning
confidence: 99%
“…Following [52], we perform a user study to compare the correlations of CIDErBtw metric (CB) and VSE++ Recall (VR) to human judgment. To show the generalization of our metric to varying degrees of model comparisons, we run experiments on captions from three model pairs: 1) two captions from different models (DiscCap and Stack-Cap); 2) two captions from similar models trained with two different conditions (Transformer+SCST models trained with and without DCR); 3) two captions from the same model (the 1st and 2nd results from beam search on Trans-former+SCST+DCR).…”
Section: Ciderbtw As a Metricmentioning
confidence: 99%
“…Another drawback of the standard metrics is that they do not capture (but rather disfavor) the desirable capability of the system to produce novel and diverse captions, which is more in line with the variability with which humans describe complex images. This consideration brought to the development of diversity metrics [132], [139], [140], [141]. Most of these metrics can potentially be calculated even when no ground-truth captions are available at test time.…”
Section: Diversity Metricsmentioning
confidence: 99%