Quality Estimation for Image Captions Based on Large-scale Human Evaluations

Levinboim, Tomer; Thapliyal, Ashish; Sharma, Piyush; Soricut, Radu

doi:10.48550/arxiv.1909.03396

Cited by 2 publications

(3 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ActivityNet Captions [16], MS COCO [26], MSR-VTT [45], Flickr30k Denotations [47], SBU [31], A2D [44], Visual Genome [17], Conceptual Captions [34], Charades [36], Charades-Ego [35], OID [21], TGIF [24], ActivityNet-Entities [49]…”

Section: A Appendix: Dataset Construction A1 Datasets Used For Long S...mentioning

confidence: 99%

NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels

Sharma¹,

Patra²,

Desai³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep learning has shown remarkable progress in a wide range of problems. However, efficient training of such models requires large-scale datasets, and getting annotations for such datasets can be challenging and costly. In this work, we explore the use of usergenerated freely available labels from web videos for video understanding. We create a benchmark dataset consisting of around 2 million videos with associated user-generated annotations and other meta information. We utilize the collected dataset for action classification and demonstrate its usefulness with existing small-scale annotated datasets, UCF101 and HMDB51. We study different loss functions and two pretraining strategies, simple and self-supervised learning. We also show how a network pretrained on the proposed dataset can help against video corruption and label noise in downstream datasets. We present this as a benchmark dataset in noisy learning for video understanding. The dataset, code, and trained models will be publicly available for future research.

show abstract

Section: A Appendix: Dataset Construction A1 Datasets Used For Long S...mentioning

confidence: 99%

NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels

Sharma¹,

Patra²,

Desai³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Finally, we perform an experiment to understand the extent to which the quality of the Stabilizer outputs is correlated with the quality of the targetlanguage Captions, so that a QE model (Levinboim et al, 2019) on the Stabilizer outputs). To that end, we perform human evaluations of stand-alone captions.…”

Section: Stabilizers Used For Quality Estimationmentioning

confidence: 99%

“…There is a final additional advantage to having PLuGS models as a solution: in real-world applications of image captioning, quality estimation of the resulting captions is an important component that has recently received attention (Levinboim et al, 2019). Again, labeled data for quality-estimation (QE) is only available for English 2 , and generating it separately for other languages of interest is expensive, time-consuming, and scales poorly.…”

Section: Introductionmentioning

confidence: 99%

Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage

Thapliyal¹,

Soricut²

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Cross-modal language generation tasks such as image captioning are directly hurt in their ability to support non-English languages by the trend of data-hungry models combined with the lack of non-English annotations. We investigate potential solutions for combining existing language-generation annotations in English with translation capabilities in order to create solutions at web-scale in both domain and language coverage. We describe an approach called Pivot-Language Generation Stabilization (PLuGS), which leverages directly at training time both existing English annotations (gold data) as well as their machinetranslated versions (silver data); at run-time, it generates first an English caption and then a corresponding target-language caption. We show that PLuGS models outperform other candidate solutions in evaluations performed over 5 different target languages, under a largedomain testset using images from the Open Images dataset. Furthermore, we find an interesting effect where the English captions generated by the PLuGS models are better than the captions generated by the original, monolingual English model.

show abstract

Quality Estimation for Image Captions Based on Large-scale Human Evaluations

Cited by 2 publications

References 20 publications

NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels

NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels

Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage

Contact Info

Product

Resources

About