Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

Xie, Yujia; Zhou, Luowei; Dai, Xiyang; Liu, Yuan; Bach, Nguyen; Liu, Ce; Zeng, Michael

doi:10.48550/arxiv.2206.01843

Cited by 1 publication

(1 citation statement)

References 25 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MAGIC (Su et al, 2022) uses a CLIP-induced score to regularize the language generation of GPT-2 so that the zeroshot generated caption is semantically related to the given image. BEST (Xie et al, 2022) uses the cooperation of Florence (Yuan et al, 2021) and GPT-3 for visual storytelling and image paragraph captioning. Wang et al (2022j) propose the cooperation of CLIP, BLIP (Li et al, 2022f), and GPT-3 for few-shot video-language learning.…”

Section: Vlp For L Big Modelsmentioning

confidence: 99%

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Gan¹,

Fu²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: (i) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; (ii) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and (iii) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.♠ Zhe Gan and Jianfeng Gao initiated the project. Zhe Gan and Linjie Li took lead in the writing of Chapter 1. Linjie Li and Jianfeng Gao took lead in the writing of Chapter 2. Zhe Gan further took lead in the writing of Chapter 3 and 7. Chunyuan Li took lead in the writing of Chapter 4. Linjie Li further took lead in the writing of Chapter 5. Lijuan Wang and Zicheng Liu took lead in the writing of Chapter 6. All the authors provided project advice, and contributed to paper editing and proofreading.

show abstract

Section: Vlp For L Big Modelsmentioning

confidence: 99%