nocaps: novel object captioning at scale

Agrawal, Harsh; Anderson, Peter; Desai, K. R.; Wang, Yufei; Chen, Xinlei; Jain, Rishabh; Johnson, Mark; Batra, Dhruv; Parikh, Devi; Lee, Stefan

doi:10.1109/iccv.2019.00904

Cited by 233 publications

(198 citation statements)

References 40 publications

Supporting

Mentioning

165

Contrasting

Order By: Relevance

“…Furthermore, when evaluating out-of-domain images or images with unseen concepts, it has been shown that the generated captions are often of poor quality (Mao et al, 2015;Vinyals et al, 2017). Attempts have been made to address the latter issue by leveraging unpaired text data or pre-trained language models (Hendricks et al, 2016;Agrawal et al, 2018).…”

Section: Introductionmentioning

confidence: 99%

Compositional Generalization in Image Captioning

Nikolaus

Abdou

Lamm

et al. 2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

View full text Add to dashboard Cite

Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. Stateof-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image-sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.

show abstract

Section: Introductionmentioning

confidence: 99%

Compositional Generalization in Image Captioning

Nikolaus

Abdou

Lamm

et al. 2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

View full text Add to dashboard Cite

show abstract

“…Multilingual Visual Understanding. Numerous tasks have been proposed to combine vision and language to enhance the understanding of either or both, such as video/image captioning [18,60,2], visual question answering (VQA) [4], and natural language moment retrieval [25], etc. Multilingual studies are rarely explored in the vision and language domain.…”

Section: Related Workmentioning

confidence: 99%

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

Wang

Chen³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

328

213

View full text Add to dashboard Cite

We present a new large-scale multilingual video description dataset, VATEX 1 , which contains over 41, 250 videos and 825, 000 captions in both English and Chinese. Among the captions, there are over 206, 000 English-Chinese parallel translation pairs. Compared to the widely-used MSR-VTT dataset [66], VATEX is multilingual, larger, linguistically complex, and more diverse in terms of both video and natural language descriptions. We also introduce two tasks for video-and-language research based on VATEX:(1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context. Extensive experiments on the VATEX dataset show that, first, the unified multilingual model can not only produce both English and Chinese descriptions for a video more efficiently, but also offer improved performance over the monolingual models. Furthermore, we demonstrate that the spatiotemporal video context can be effectively utilized to align source and target languages and thus assist machine translation. In the end, we discuss the potentials of using VATEX for other video-and-language research. * Equal contribution. 1 VATEX stands for Video And TEXt, where X also represents various languages.

show abstract

“…Many of the popular captioning datasets in the AI community were created using the same basic crowdsourcing task design. This task design, first developed in 2013 [49,106], remains the standard approach [7,28]. One concern about crowdsourced datasets built using this standard task design is that captions for the same image generated by different people can vary considerably [53,98].…”

Section: The Critical Foundation Of Image Captioning Algorithms: Largmentioning

confidence: 99%

"I Hope This Is Helpful"

Simons

Gurari

Fleischmann

2020

Proc. ACM Hum.-Comput. Interact.

View full text Add to dashboard Cite

AI image captioning challenges encourage broad participation in designing algorithms that automatically create captions for a variety of images and users. To create large datasets necessary for these challenges, researchers typically employ a shared crowdsourcing task design for image captioning. This paper discusses findings from our thematic analysis of 1,064 comments left by Amazon Mechanical Turk workers using this task design to create captions for images taken by people who are blind. Workers discussed difficulties in understanding how to complete this task, provided suggestions of how to improve the task, gave explanations or clarifications about their work, and described why they found this particular task rewarding or interesting. Our analysis provides insights both into this particular genre of task as well as broader considerations for how to employ crowdsourcing to generate large datasets for developing AI algorithms.

show abstract

nocaps: novel object captioning at scale

Abstract: Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data.

Cited by 233 publications

References 40 publications

Compositional Generalization in Image Captioning

Compositional Generalization in Image Captioning

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

"I Hope This Is Helpful"

Contact Info

Product

Resources

About