CIDEr: Consensus-based image description evaluation

Vedantam, Ramakrishna; Zitnick, C. Lawrence; Parikh, Devi

doi:10.1109/cvpr.2015.7299087

Cited by 3,224 publications

(1,885 citation statements)

References 32 publications

Supporting

Mentioning

1,879

Contrasting

Unclassified

Order By: Relevance

“…We calculate BLEU (Papineni et al 2002), CIDEr (Vedantam et al 2015a), and METEOR (Denkowski and Lavie 2014) scores between the generated descriptions and their ground-truth descriptions. In all cases, the model trained on VisualGenome performs better.…”

Section: Generating Region Descriptionsmentioning

confidence: 99%

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna

Zhu

Groth³

et al. 2017

Int J Comput Vis

4,065

3,078

View full text Add to dashboard Cite

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage." In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest

show abstract

Section: Generating Region Descriptionsmentioning

confidence: 99%

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna

Zhu

Groth³

et al. 2017

Int J Comput Vis

4,065

3,078

View full text Add to dashboard Cite

show abstract

“…BLEU is the precision of word n-grams between generated and reference sentences. Additionally, scores like METEOR (Vedantam et al, 2015) which capture perplexity of models for a given transcription have gained widespread attention. Perplexity is the geometric mean of the inverse probability for each predicted word.…”

Section: Resultsmentioning

confidence: 99%

Generating Image Captions in Arabic using Root-Word Based Recurrent Neural Networks and Deep Neural Networks

Jindal¹

2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: St

View full text Add to dashboard Cite

Image caption generation has gathered widespread interest in the artificial intelligence community. Automatic generation of an image description requires both computer vision and natural language processing techniques. While, there has been advanced research in English caption generation, research on generating Arabic descriptions of an image is extremely limited. Semitic languages like Arabic are heavily influenced by root-words. We leverage this critical dependency of Arabic to generate captions of an image directly in Arabic using root-word based Recurrent Neural Network and Deep Neural Networks. Experimental results on datasets from various Middle Eastern newspaper websites allow us to report the first BLEU score for direct Arabic caption generation. We also compare the results of our approach with BLEU score captions generated in English and translated into Arabic. Experimental results confirm that generating image captions using root-words directly in Arabic significantly outperforms the English-Arabic translated captions using state-of-the-art methods.

show abstract

“…5.3.2. As known from literature Elliott and Keller 2013;Vedantam et al 2015), automatic evaluation measures do not always agree with the human evaluation. Here we see that human judges prefer the descriptions from Frame-Video-Concept Fusion approach in terms of correctness, grammar and relevance.…”

Section: Lsmdc 15mentioning

confidence: 97%

“…6), we focus our discussion on METEOR and CIDEr scores in the preliminary evaluations in this section. According to Elliott and Keller (2013) and Vedantam et al (2015), METEOR/CIDEr supersede previously used measures in terms of agreement with human judgments.…”

Section: Automatic Metricsmentioning

confidence: 99%

Movie Description

et al. 2017

View full text Add to dashboard Cite

Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. We introduce the Large Scale Movie Description Challenge (LSMDC) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies (around 150 h of video in total).

show abstract

CIDEr: Consensus-based image description evaluation

Cited by 3,224 publications

References 32 publications

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Generating Image Captions in Arabic using Root-Word Based Recurrent Neural Networks and Deep Neural Networks

Movie Description

Contact Info

Product

Resources

About