COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval

Li, Xirong; Xu, Chaoxi; Wang, Xiaoxu; Lan, Weiyu; Jia, Zhengxiong; Yang, Gang; Xu, Jieping

doi:10.1109/tmm.2019.2896494

Cited by 97 publications

(74 citation statements)

References 40 publications

(83 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pappas et al [42] propose multilingual visual concept clustering to study the commonalities and differences among different languages. Meanwhile, multilingual image captioning is introduced to describe the content of an image with multiple languages [32,57,33]. But none of them study the interaction between videos and multilingual knowledge.…”

Section: Related Workmentioning

confidence: 99%

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

Wang

Chen³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

332

213

View full text Add to dashboard Cite

We present a new large-scale multilingual video description dataset, VATEX 1 , which contains over 41, 250 videos and 825, 000 captions in both English and Chinese. Among the captions, there are over 206, 000 English-Chinese parallel translation pairs. Compared to the widely-used MSR-VTT dataset [66], VATEX is multilingual, larger, linguistically complex, and more diverse in terms of both video and natural language descriptions. We also introduce two tasks for video-and-language research based on VATEX:(1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context. Extensive experiments on the VATEX dataset show that, first, the unified multilingual model can not only produce both English and Chinese descriptions for a video more efficiently, but also offer improved performance over the monolingual models. Furthermore, we demonstrate that the spatiotemporal video context can be effectively utilized to align source and target languages and thus assist machine translation. In the end, we discuss the potentials of using VATEX for other video-and-language research. * Equal contribution. 1 VATEX stands for Video And TEXt, where X also represents various languages.

show abstract

Section: Related Workmentioning

confidence: 99%

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

Wang

Chen³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

332

213

View full text Add to dashboard Cite

show abstract

“…Yoshikawa et al [27] further enlarged the collection of Japanese captions for MS COCO and released the STAIR Captions dataset. There are also some extensions for Chinese, such as [28], [29]. Li et al [28] presented comparison of Chinese caption datasets constructed by crowdsourcing and machine translation.…”

Section: B Cross-lingual Vision and Languagementioning

confidence: 99%

“…Li et al [28] presented comparison of Chinese caption datasets constructed by crowdsourcing and machine translation. Li et al [29] added Chinese captions and tags for MS COCO. For video captions, Chen and Dolan [30] collected short video clips and captions in many different languages.…”

Section: B Cross-lingual Vision and Languagementioning

confidence: 99%

Cross-Lingual Visual Grounding

et al. 2021

View full text Add to dashboard Cite

“…Besides, generating image caption is mostly in English, as most of the available datasets are in this language [31,34]. Only few studies have been conducted on cross-lingual image captioning [17,[38][39]. In this paper, the model is designed to perform cross-lingual image caption.…”

Section: Related Workmentioning

confidence: 99%

Cross-Lingual Image Caption Generation Based on Visual Attention Model

Wang

Zhang

et al. 2020

IEEE Access

View full text Add to dashboard Cite

As an interesting and challenging problem, generating image caption automatically has attracted increasingly attention in natural language processing and computer vision communities. In this paper, we propose an end-to-end deep learning approach for image caption generation. We leverage image feature information at specific location every moment and generate the corresponding caption description through a semantic attention model. The end-to-end framework allows us to introduce an independent recurrent structure as an attention module, derived by calculating the similarity between image feature sequence and semantic word sequence. Additionally, our model is designed to transfer the knowledge representation obtained from the English portion into the Chinese portion to achieve the cross-lingual image captioning. We evaluate the proposed model on the most popular benchmark datasets. We report an improvement of 3.9% over existing state-of-the-art approaches for cross-lingual image captioning on the Flickr8k CN dataset on CIDEr metric. The experimental results demonstrate the effectiveness of our attention model.

show abstract

COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval

Cited by 97 publications

References 40 publications

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

Cross-Lingual Visual Grounding

Cross-Lingual Image Caption Generation Based on Visual Attention Model

Contact Info

Product

Resources

About