UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning

Lee, Hwanhee; Yoon, Seunghyun; Dernoncourt, Franck; Bui, Trung; Jung, Kyomin

doi:10.48550/arxiv.2106.14019

Cited by 4 publications

(8 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a similar vein, the Alignment score [152] analyzes whether ideas are mentioned in a human-like order by comparing the alignment of noun sequences in candidate and reference sentences. Moreover, the Coverage score [153], [154] calculates the extent of a caption by taking into account the scene's indicated visual elements. This score directly considers visual elements and can be used even without ground-truth captions.…”

Section: ) Embedding-based Metricsmentioning

confidence: 99%

Deep Learning Approaches for Image Captioning: Opportunities, Challenges and Future Potential

Jamil,

Saif-Ur-Rehman,

Mahmood

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Generative intelligence relies heavily on the integration of vision and language. Much of the research has focused on image captioning, which involves describing images with meaningful sentences. Typically, when generating sentences that describe the visual content, a language model and a vision encoder are commonly employed. Because of the incorporation of object areas, properties, multi-modal connections, attentive techniques, and early fusion approaches like bidirectional encoder representations from transformers (BERT), these components have experienced substantial advancements over the years. This research offers a reference to the body of literature, identifies emerging trends in an area that blends computer vision as well as natural language processing in order to maximize their complementary effects, and identifies the most significant technological improvements in architectures employed for image captioning. It also discusses various problem variants and open challenges. This comparison allows for an objective assessment of different techniques, architectures, and training strategies by identifying the most significant technical innovations, and offers valuable insights into the current landscape of image captioning research.

show abstract

Section: ) Embedding-based Metricsmentioning

confidence: 99%

Deep Learning Approaches for Image Captioning: Opportunities, Challenges and Future Potential

Jamil,

Saif-Ur-Rehman,

Mahmood

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Researchers have also proposed unreferenced image captioning metrics that evaluate generated captions by comparing them with original images. For instance, VIFIDEL (Madhyastha et al, 2019) uses the word mover distance (Kusner et al, 2015) between the image and candidate caption, and UMIC (Lee et al, 2021), which fine-tunes UNITER (Chen et al, 2020) using contrastive loss from augmented captions, directly evaluates captions generated from vision-and-language embedding spaces.…”

Section: Related Workmentioning

confidence: 99%

“…Table 3 shows that PR-MCS is an useful image captioning metric with high correlation with human judgment. Flickr8k_Expert (Hodosh et al, 2013) and CapEval1k (Lee et al, 2021) are evaluation sets for measuring the performance of image captioning metric, and the higher the Kendall tau-c (τ c ) value (Kendall, 1938) and the Pearson correlation coefficient (ρ) (Benesty et al, 2009), indicators for viewing the correlation with human judgment, the better. The Kendall tau-c value is the similarity between the two variables based on ranking, and the Pearson correlation coefficient is a measure of linear correlation between two sets of data.…”

Section: Correlations With Human Judgementmentioning

confidence: 99%

PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning

Kim¹,

Hwang²,

Yun³

et al. 2023

Preprint

View full text Add to dashboard Cite

Vulnerability to lexical perturbation is a critical weakness of automatic evaluation metrics for image captioning.This paper proposes Perturbation Robust Multi-Lingual CLIPScore(PR-MCS), which exhibits robustness to such perturbations, as a novel reference-free image captioning metric applicable to multiple languages. To achieve perturbation robustness, we fine-tune the text encoder of CLIP with our language-agnostic method to distinguish the perturbed text from the original text. To verify the robustness of PR-MCS, we introduce a new fine-grained evaluation dataset consisting of detailed captions, critical objects, and the relationships between the objects for 3, 000 images in five languages 1 . In our experiments, PR-MCS significantly outperforms baseline metrics in capturing lexical noise of all various perturbation types in all five languages, proving that PR-MCS is highly robust to lexical perturbations.

show abstract

“…Sariyildiz et al [65] introduced transformer-based image conditional masking language modelling (ICMLM) for learning the visual representation of image-caption pairs. Lee et al [66] proposed a new metric UMIC, an unreferenced metric for image captioning which does not require reference captions to evaluate image captions, and adopted a pre-trained transformer to generate captions. Yang et al proposed [67] a novel transformer, ReFormer, adapted to generate features embedded in relational information and clearly express the paired relations between objects in images.…”

Section: Enhanced Image Captioning With Attention Correctionmentioning

confidence: 99%

“…Since several commonly used evaluation metrics are more from machine translation, the optimization direction is more inclined to cross-entropy loss. Therefore, researchers are also putting more effort into making more muscular reward functions [66,90,[112][113][114].…”

Section: Recent Trendsmentioning

confidence: 99%

A thorough review of models, evaluation metrics, and datasets on image captioning

Luo

Cheng

Chao

et al. 2021

IET Image Processing

View full text Add to dashboard Cite

Image captioning means generate descriptive sentences from a query image automatically. It has recently received widespread attention from the computer vision and natural language processing communities as an emerging visual task. Currently, both components have evolved considerably by exploiting object regions, attributes, attention mechanism methods, entity recognition with novelties, and training strategies. However, despite the impressive results, the research has not yet come to a conclusive answer. This survey aims to provide a comprehensive overview of image captioning methods, from technical architectures to benchmark datasets, evaluation metrics, and comparison of state-of-theart methods. In particular, image captioning methods are divided into different categories based on the technique adopted. Representative methods in each class are summarized, and their advantages and limitations are discussed. Moreover, many related state-of-the-art studies were quantitatively compared to determine the recent trends and future directions in image captioning. The ultimate goal of this work is to serve as a tool for understanding the existing literature and highlighting future directions in the area of image captioning for Computer Vision and Natural Language Processing communities may benefit from.

show abstract

UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning

Cited by 4 publications

References 12 publications

Deep Learning Approaches for Image Captioning: Opportunities, Challenges and Future Potential

Deep Learning Approaches for Image Captioning: Opportunities, Challenges and Future Potential

PR-MCS: Perturbation Robust Metric for MultiLingual Image Captioning

A thorough review of models, evaluation metrics, and datasets on image captioning

Contact Info

Product

Resources

About