Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Laina, Iro; Rupprecht, Christian; Navab, Nassir

doi:10.1109/iccv.2019.00751

Cited by 87 publications

(91 citation statements)

References 58 publications

(78 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Similarly, Peri et al [53] proposed a multimodal framework that encodes both images and captions using CNN and RNN as an intermediate level representation and then decodes these multimodal representations into a new caption that is similar to the input. The authors of [128] presented an unsupervised image captioning framework based on a new alignment method that allows the simultaneous integration of visual and textual streams through semantic learning of multimodal embeddings of the language and vision domains. Moreover, a multimodal model can also aggregate motion information [174], acoustic information [175], temporal information [176], etc.…”

Section: Image Captioningmentioning

confidence: 99%

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

et al. 2021

View full text Add to dashboard Cite

The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.

show abstract

Section: Image Captioningmentioning

confidence: 99%

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Unsupervised Grounding Learning Prior works have explored learning grounding with weak or no supervision (Rohrbach et al, 2016;Xiao et al, 2017;. Closest to this paper is unsupervised image captioning (Feng et al, 2019;Laina et al, 2019;Gu et al, 2019), which conducts image captioning with unpaired images and captions. Similar to this work, the detector tags serve as the anchor points for image captioning.…”

Section: Unsupervised Multi-lingual Language Modelmentioning

confidence: 99%

Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

Li¹,

You²,

Wang³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Pre-trained contextual vision-and-language (V&L) models have achieved impressive performance on various benchmarks. However, existing models require a large amount of parallel image-caption data for pre-training. Such data are costly to collect and require cumbersome curation.Inspired by unsupervised machine translation, we investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora. In particular, we propose to conduct "mask-and-predict" pre-training on text-only and image-only corpora and introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. We find that such a simple approach achieves performance close to a model pre-trained with aligned data, on four English V&L benchmarks. Our work challenges the widely held notion that aligned data is necessary for V&L pre-training, while significantly reducing the amount of supervision needed for V&L models.

show abstract

“…It is expensive to take long hours of laboring to collect such labeled data, thus there is a strong interest to develop the algorithm which does not need a lot of annotated examples. Some studies embed the visual feature and text imformation into a mutual space and design unsupervised learning algorithm to reduce the requirement for annotated data (Gu et al, 2018), (Gu et al, 2019), (Laina et al, 2019), (Feng et al, 2019). However, the performances of such algorithms are poor because they do not use pairs of labeled examples at all.…”

Section: Introductionmentioning

confidence: 99%

Semi-Supervised Learning for Video Captioning

Lin

Gan

Wang

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Deep neural networks have made great success on video captioning in supervised learning setting. However, annotating videos with descriptions is very expensive and time-consuming. If the video captioning algorithm can benefit from a large number of unlabeled videos, the cost of annotation can be reduced. In the proposed study, we make the first attempt to train the video captioning model on labeled data and unlabeled data jointly, in a semisupervised learning manner. For labeled data, we train them with the traditional crossentropy loss. For unlabeled data, we leverage a self-critical policy gradient method with the difference between the scores obtained by Monte-Carlo sampling and greedy decoding as the reward function, while the scores are the K-L divergence between output distributions of original video data and augmented video data. The final loss is the weighted sum of losses obtained by labeled data and unlabeled data. Experiments conducted on VATEX, MSR-VTT and MSVD dataset demonstrate that the introduction of unlabeled data can improve the performance of the video captioning model. The proposed semi-supervised learning algorithm also outperforms several state-of-the-art semisupervised learning approaches.

show abstract

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Cited by 87 publications

References 58 publications

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions

Semi-Supervised Learning for Video Captioning

Contact Info

Product

Resources

About