2023
DOI: 10.48550/arxiv.2302.04237
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Adversarial Prompting for Black Box Foundation Models

Abstract: Prompting interfaces allow users to quickly adjust the output of generative models in both vision and language. However, small changes and design choices in the prompt can lead to significant differences in the output. In this work, we develop a black-box framework for generating adversarial prompts for unstructured image and text generation. These prompts, which can be standalone or prepended to benign prompts, induce specific behaviors into the generative process, such as generating images of a particular ob… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 22 publications
0
6
0
Order By: Relevance
“…Recent literature has unveiled numerous failure modes, commonly termed as "jailbreaks," that circumvent the alignment mechanisms and safety guardrails implemented in modern LLMs [103], [4]. Among the identified jailbreaks, a notable category involves adversarial prompting, where an attacker manipulates prompts passed as input to a targeted LLM, tricking it into generating objectionable content [110], [111]. The detection of numerous jailbreak attacks in Large Language Models (LLMs) has paved the way for further enhancements to these models.…”
Section: Rise Of Mllmsmentioning
confidence: 99%
“…Recent literature has unveiled numerous failure modes, commonly termed as "jailbreaks," that circumvent the alignment mechanisms and safety guardrails implemented in modern LLMs [103], [4]. Among the identified jailbreaks, a notable category involves adversarial prompting, where an attacker manipulates prompts passed as input to a targeted LLM, tricking it into generating objectionable content [110], [111]. The detection of numerous jailbreak attacks in Large Language Models (LLMs) has paved the way for further enhancements to these models.…”
Section: Rise Of Mllmsmentioning
confidence: 99%
“…In addition to adversaries in training data, prompts can also be attacked (Maus et al, 2023), which requires further knowledge and algorithms to deal with. This is currently a challenging problem due to the sensitivity of prompting to LLMs.…”
Section: Adversarial Attack Remains a Major Threatmentioning
confidence: 99%
“…However, there has been limited research on vulnerabilities specifically related to targeted image generation. Though there are few methods [20,21,24] that attack Stable Diffusion for generating the specific images, they usually explore some fabricated words that are difficult for a comprehensive analysis of model vulnerabilities.…”
Section: Vulnerabilities In Stable Diffusionmentioning
confidence: 99%
“…However, a notable gap in these methods is their limited capacity to uncover vulnerabilities associated with the covert generation of targeted images. Although there are few targeted attack methods [20,21,24] for generating the specific image, they usually explore some fabricated adversarial words like 'Napatree' and 'uccoisegeljaros' that are easily detectable, which are difficult for a comprehensive analysis of model vulnerabilities.…”
Section: Introductionmentioning
confidence: 99%