Unified Vision-Language Pre-Training for Image Captioning and VQA

Zhou, Luowei; Palangi, Hamid; Zhang, Lei; Hu, Houdong; Corso, Jason J.; Gao, Jianfeng

doi:10.1609/aaai.v34i07.7005

Cited by 587 publications

(410 citation statements)

References 28 publications

(48 reference statements)

Supporting

Mentioning

410

Contrasting

Order By: Relevance

“…While multi-head attention has been widely exploited in many vision-language (VL) tasks, such as image captioning (Zhou et al, 2020), visual question answering (Tan and Bansal, 2019;, and visual dialog (Kang et al, 2019;Wang et al, 2020), its potential benefit to model flexible cross-media posts has been previously ignored. Due to the informal style in social media, cross-media keyphrase prediction brings unique difficulties mainly in two aspects: first, its textimage relationship is rather complicated (Vempala and Preotiuc-Pietro, 2019) while in conventional VL tasks the two modalities have most semantics shared; second, social media images usually exhibit a more diverse distribution and a much higher probability of containing OCR tokens ( §4), thereby posing a hurdle for effectively processing.…”

Section: Related Workmentioning

confidence: 99%

Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings

Wang

Lyu

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Social media produces large amounts of contents every day. To help users quickly capture what they need, keyphrase prediction is receiving a growing attention. Nevertheless, most prior efforts focus on text modeling, largely ignoring the rich features embedded in the matching images. In this work, we explore the joint effects of texts and images in predicting the keyphrases for a multimedia post. To better align social media style texts and images, we propose: (1) a novel Multi-Modality Multi-Head Attention (M 3 H-Att) to capture the intricate cross-media interactions; (2) image wordings, in forms of optical characters and image attributes, to bridge the two modalities. Moreover, we design a novel unified framework to leverage the outputs of keyphrase classification and generation and couple their advantages. Extensive experiments on a large-scale dataset 1 newly collected from Twitter show that our model significantly outperforms the previous state of the art based on traditional co-attentions. Further analyses show that our multi-head attention is able to attend information from various aspects and boost classification or generation in diverse scenarios.

show abstract

Section: Related Workmentioning

confidence: 99%

Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings

Wang

Lyu

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…• L. Zhou et al [19], Vision Language Pre-training (VLP): State-of-the-art attention-based model for vision-language generation tasks such as image captioning and visual question-answering. • S. Zaho et al [8], Feature-based IER: Image Emotion Recognition approach using low and mid-level visual features.…”

Section: State-of-the-art (Sota) Methods For Performance Comparisonmentioning

confidence: 99%

Domain Adaptation Based Technique for Image Emotion Recognition Using Image Captions

Kumar

Raman

2021

Communications in Computer and Information Science

View full text Add to dashboard Cite

Images are powerful tools for affective content analysis. Image emotion recognition is useful for graphics, gaming, animation, entertainment, and cinematography. In this paper, a technique for recognizing the emotions in images containing facial, non-facial, and nonhuman components has been proposed. The emotion-labeled images are mapped to their corresponding textual captions. Then the captions are used to re-train a text emotion recognition model as the domainadaptation approach. The adapted text emotion recognition model has been used to classify the captions into discrete emotion classes. As image captions have a one-to-one mapping with the images, the emotion labels predicted for the captions have been considered the emotion labels of the images. The suitability of using the image captions for emotion classification has been evaluated using caption-evaluation metrics. The proposed approach serves as an example to address the unavailability of sufficient emotion-labeled image datasets and pre-trained models. It has demonstrated an accuracy of 59.17% for image emotion recognition.

show abstract

“…We briefly describe our base captioning model, which consists of a Faster R-CNN and Transformerbased encoder-decoder, following the sequence-tosequence framework common in state-of-the-art image captioning systems (Anderson et al, 2018;Vinyals et al, 2015;Zhou et al, 2019). See Appendix A for full technical details.…”

Section: Model Architecturementioning

confidence: 99%

CapWAP: Image Captioning with a Purpose

Fisch

Lee

Chang

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

The traditional image captioning task uses generic reference captions to provide textual information about images. Different user populations, however, will care about different visual aspects of images. In this paper, we propose a new task, Captioning with a Purpose (CAPWAP). Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population, rather than merely provide generic information about an image. In this task, we use questionanswer (QA) pairs-a natural expression of information need-from users, instead of reference captions, for both training and postinference evaluation. We show that it is possible to use reinforcement learning to directly optimize for the intended information need, by rewarding outputs that allow a question answering model to provide correct answers to sampled user questions. We convert several visual question answering datasets into CAP-WAP datasets, and demonstrate that under a variety of scenarios our purposeful captioning system learns to anticipate and fulfill specific information needs better than its generic counterparts, as measured by QA performance on user questions from unseen images, when using the caption alone as context. * Work primarily completed while interning at Google. Task Caption Information Need Captioning There is a green bus. (Unspecified) Visual QA (Unspecified) Where's it headed? CAPWAP At least three people are boarding the #14 bus to Bembridge. Which bus is this? Where's it headed? How many people are boarding?

show abstract

Unified Vision-Language Pre-Training for Image Captioning and VQA

Cited by 587 publications

References 28 publications

Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings

Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings

Domain Adaptation Based Technique for Image Emotion Recognition Using Image Captions

CapWAP: Image Captioning with a Purpose

Contact Info

Product

Resources

About