Linearly Mapping from Image to Text Space

Merullo, Jack; Castricato, Louis; Eickhoff, Carsten; Pavlick, Ellie

doi:10.48550/arxiv.2209.15162

Cited by 7 publications

(13 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some works have also been proposed to answer this question. Merullo et al proposed a method [146] that injects a linear projection between the frozen image encoder and the text encoder. During training, only the linear projection is tuned.…”

Section: Vision Language Generationmentioning

confidence: 99%

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

Cao¹,

Li²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recently, ChatGPT, along with DALL-E-2 [1] and Codex [2],has been gaining significant attention from society. As a result, many individuals have become interested in related resources and are seeking to uncover the background and secrets behind its impressive performance. In fact, ChatGPT and other Generative AI (GAI) techniques belong to the category of Artificial Intelligence Generated Content (AIGC), which involves the creation of digital content, such as images, music, and natural language, through AI models. The goal of AIGC is to make the content creation process more efficient and accessible, allowing for the production of high-quality content at a faster pace. AIGC is achieved by extracting and understanding intent information from instructions provided by human, and generating the content according to its knowledge and the intent information. In recent years, large-scale models have become increasingly important in AIGC as they provide better intent extraction and thus, improved generation results. With the growth of data and the size of the models, the distribution that the model can learn becomes more comprehensive and closer to reality, leading to more realistic and high-quality content generation. This survey provides a comprehensive review on the history of generative models, and basic components, recent advances in AIGC from unimodal interaction and multimodal interaction. From the perspective of unimodality, we introduce the generation tasks and relative models of text and image. From the perspective of multimodality, we introduce the cross-application between the modalities mentioned above. Finally, we discuss the existing open problems and future challenges in AIGC.

show abstract

Section: Vision Language Generationmentioning

confidence: 99%

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

Cao¹,

Li²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Many recent works have transferred it on multiple downstream tasks, including semantic segmentation [8,24], object detection [5], Visual Question Answering [48] and image generation [33]. Many researchers regard CLIP as a pre-trained feature extractor [8,24,26,38,52]. [26,38] directly utilize CLIP as an image encoder to extract visual context.…”

Section: Vision-language Contrastive Learningmentioning

confidence: 99%

“…Many researchers regard CLIP as a pre-trained feature extractor [8,24,26,38,52]. [26,38] directly utilize CLIP as an image encoder to extract visual context. [8,24] employ CLIP to align the image with the target class for open-vocabulary tasks.…”

Section: Vision-language Contrastive Learningmentioning

confidence: 99%

“…As a large-scale pre-trained model, CLIP has been widely studied on downstream tasks. Existing CLIP-based works can be divided into two commonly used methods: fusion [8,24,26,38,52] and distillation methods [5,33,51,57]. On one hand, fusion methods regard CLIP as a pre-trained feature extractor and directly employ it for feature extraction (Fig.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition

Wang,

Xie,

Wang

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP) model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR) to leverage both visual and linguistic knowledge in CLIP. Different from previous CLIP-based methods mainly considering feature generalization on visual encoding, we propose a symmetrical distillation strategy (SDS) that further captures the linguistic knowledge in the CLIP text encoder. By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow that covers not only visual but also linguistic information for distillation. Benefiting from the natural alignment in CLIP, such guidance flow provides a progressive optimization objective from vision to language, which can supervise the STR feature forwarding process layer-by-layer. Besides, a new Linguistic Consistency Loss (LCL) is proposed to enhance the linguistic capability by considering second-order statistics during the optimization. Overall, CLIP-OCR is the first to design a smooth transition between image and text for the STR task. Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks. Code will be available at https://github.com/wzx99/CLIPOCR. CCS CONCEPTS• Applied computing → Optical character recognition.

show abstract

“…MAPL keeps both the vision encoder and the LM frozen (thus further reducing the number of trainable parameters) and only learns a lightweight mapping network to connect both frozen models. Similar to MAPL, concurrent work LiMBeR (Merullo et al, 2022) also proposes to connect a frozen vision encoder with a frozen LM but using a linear mapping, which is not as parameter-and compute-efficient as MAPL (Sec. 4.5).…”

Section: Related Workmentioning

confidence: 99%

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Mañas,

Rodriguez Lopez,

Ahmadi

et al. 2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

Large pre-trained models have proved to be remarkable zero-and (prompt-based) fewshot learners in unimodal vision and language tasks. We propose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL's modularity enables easy extension to other pretrained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/oscmansan/mapl.

show abstract

Linearly Mapping from Image to Text Space

Cited by 7 publications

References 0 publications

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Contact Info

Product

Resources

About