Unleashing the Power of Visual Prompting At the Pixel Level

Wu, Junyang; Li, Xianhang; Chen, Wei; Wang, Huiyu; Yuille, Alan; Zhou, Yuyin; Xie, Cihang

doi:10.48550/arxiv.2212.10556

Cited by 4 publications

(8 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As the codebook of BEITv2 is distilled from CLIP, here we set some prompting methods designed on the image encoder of CLIP as baselines for direct comparisons. They are: a) finetuning CLIP; b) linear probing on CLIP; c) using textual prompt (TP); d) TP + visual prompt (VP) [1], which adds perturbation on the pixel; e) TP + PGN [38], which generates prompts for input; f) EVP [57], which adds prompts on the pixel with improved generalization; g) ILM-VP [6], which adds prompts on the pixel and learns a label mapping.…”

Section: Baseline Methodsmentioning

confidence: 99%

“…Experimentally, VPTM outperforms other visual prompt learning methods [26,1,6,38,57] with better efficiency. Extensive experiments show the consistency between pretraining and downstream visual classification contributes to the robustness against learning strategies for different datasets, prompt locations, prompt length, and prototype dimensions.…”

Section: Introductionmentioning

confidence: 93%

“…Mid. Visual prompt methods designed on discriminative pre-trained models concentrate on adding prompts to input space (VPT [26], VP [1], ILM-VP [6], EVP [57]) or learning prompt network (PGN [38]), while ignoring task consistency. Bottom.…”

Section: Introductionmentioning

confidence: 99%

“…Visual prompting (VP) [1] modifies the pixel space with learnable parameters to perform visual prompt learning on CLIP [46], which has been pre-trained by contrastive learning. Till now, the current visual prompting methods [38,54,57,41] are all designed on discriminative pre-trained models shown in the middle of Fig. 1.…”

Section: Introductionmentioning

confidence: 99%

“…There lacks prompt learning method carefully designed for the generative pre-trained visual model. Particularly, regardless of the efforts paid on adding prompts in the input space [26,1,57,6], learning prompt networks [38] or designing prompt blocks [41], unifying the forms of pretraining and downstream applications by task reformulation to achieve consistency remains unexplored. In view of the improved performance, efficiency and stability brought by the task consistency of prompt learning in NLP, we aim at generative visual prompt learning by inheriting the generative pre-training task to achieve consistency.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Rethinking Visual Prompt Learning as Masked Visual Token Modeling

Li¹,

Shi²,

Cao³

et al. 2023

Preprint

View full text Add to dashboard Cite

Prompt learning has achieved great success in efficiently exploiting large-scale pre-trained models in natural language processing (NLP). It reformulates the downstream tasks as the generative pre-training ones, thus narrowing down the gap between them and improving the performance stably. However, when transferring it to the vision area, current visual prompt learning methods are all designed on discriminative pre-trained models, and there is also a lack of careful design to unify the forms of pre-training and downstream tasks. To explore prompt learning on the generative pre-trained visual model as well as keeping the task consistency, we propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction. In addition, we develop the prototypical verbalizer for mapping the predicted visual token with implicit semantics to explicit downstream labels. To our best knowledge, VPTM is the first visual prompt method on the generative pre-trained visual model, and the first to achieve consistency between pre-training and downstream visual classification by task reformulation. Experiments show that VPTM outperforms other visual prompt methods and achieves excellent efficiency. Moreover, the task consistency of VPTM contributes to the robustness against prompt location, prompt length and prototype dimension, and could be deployed uniformly.

show abstract

Section: Baseline Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 93%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Rethinking Visual Prompt Learning as Masked Visual Token Modeling

Li¹,

Shi²,

Cao³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

Are Vision Transformers Robust to Patch Perturbations?

Gu¹,

Tresp²,

Qin

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be created manually as natural language instructions or generated automatically as either natural language instructions or vector representations. Prompt engineering enables the ability to perform predictions based solely on prompts without updating model parameters, and the easier application of large pre-trained models in real-world tasks. In past years, Prompt engineering has been well-studied in natural language processing. Recently, it has also been intensively studied in vision-language modeling. However, there is currently a lack of a systematic overview of prompt engineering on pre-trained vision-language models. This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e.g., Flamingo), image-text matching models (e.g., CLIP), and text-to-image generation models (e.g., Stable Diffusion). For each type of model, a brief model summary, prompting methods, prompting-based applications, and the corresponding responsibility and integrity issues are summarized and discussed. Furthermore, the commonalities and differences between prompting on vision-language models, language models, and vision models are also discussed. The challenges, future directions, and research opportunities are summarized to foster future research on this topic.

show abstract