Adversarial Prompting for Black Box Foundation Models

Maus, Natalie; Chao, Patrick; Wong, Eric T.; Gardner, Jacob R.

doi:10.48550/arxiv.2302.04237

Cited by 5 publications

(6 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent literature has unveiled numerous failure modes, commonly termed as "jailbreaks," that circumvent the alignment mechanisms and safety guardrails implemented in modern LLMs [103], [4]. Among the identified jailbreaks, a notable category involves adversarial prompting, where an attacker manipulates prompts passed as input to a targeted LLM, tricking it into generating objectionable content [110], [111]. The detection of numerous jailbreak attacks in Large Language Models (LLMs) has paved the way for further enhancements to these models.…”

Section: Rise Of Mllmsmentioning

confidence: 99%

Red Teaming for Multimodal Large Language Models: A Survey

Mahato,

Kumar,

Singh

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

Section: Rise Of Mllmsmentioning

confidence: 99%

Red Teaming for Multimodal Large Language Models: A Survey

Mahato,

Kumar,

Singh

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…In addition to adversaries in training data, prompts can also be attacked (Maus et al, 2023), which requires further knowledge and algorithms to deal with. This is currently a challenging problem due to the sensitivity of prompting to LLMs.…”

Section: Adversarial Attack Remains a Major Threatmentioning

confidence: 99%

On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective

Wang¹,

Hu²,

Hou³

et al. 2023

Preprint

View full text Add to dashboard Cite

ChatGPT is a recent chatbot service released by OpenAI and is receiving increasing attention over the past few months. While evaluations of various aspects of Chat-GPT have been done, its robustness, i.e., the performance to unexpected inputs, is still unclear to the public. Robustness is of particular concern in responsible AI, especially for safety-critical applications. In this paper, we conduct a thorough evaluation of the robustness of ChatGPT from the adversarial and out-of-distribution (OOD) perspective. To do so, we employ the AdvGLUE and ANLI benchmarks to assess adversarial robustness and the Flipkart review and DDXPlus medical diagnosis datasets for OOD evaluation. We select several popular foundation models as baselines. Results show that ChatGPT shows consistent advantages on most adversarial and OOD classification and translation tasks. However, the absolute performance is far from perfection, which suggests that adversarial and OOD robustness remains a significant threat to foundation models. Moreover, ChatGPT shows astounding performance in understanding dialogue-related texts and we find that it tends to provide informal suggestions for medical tasks instead of definitive answers. Finally, we present in-depth discussions of possible research directions.

show abstract

“…However, there has been limited research on vulnerabilities specifically related to targeted image generation. Though there are few methods [20,21,24] that attack Stable Diffusion for generating the specific images, they usually explore some fabricated words that are difficult for a comprehensive analysis of model vulnerabilities.…”

Section: Vulnerabilities In Stable Diffusionmentioning

confidence: 99%

“…However, a notable gap in these methods is their limited capacity to uncover vulnerabilities associated with the covert generation of targeted images. Although there are few targeted attack methods [20,21,24] for generating the specific image, they usually explore some fabricated adversarial words like 'Napatree' and 'uccoisegeljaros' that are easily detectable, which are difficult for a comprehensive analysis of model vulnerabilities.…”

Section: Introductionmentioning

confidence: 99%

Foreword to the Special Issue on Photocatalysis

Zhang

Wang

2021

Trans. Tianjin Univ.

View full text Add to dashboard Cite

Recent developments in text-to-image models, particularly Stable Diffusion, have marked significant achievements in various applications. With these advancements, there are growing safety concerns about the vulnerability of the model that malicious entities exploit to generate targeted harmful images. However, the existing methods in the vulnerability of the model mainly evaluate the alignment between the prompt and generated images, but fall short in revealing the vulnerability associated with targeted image generation. In this study, we formulate the problem of targeted adversarial attack on Stable Diffusion and propose a framework to generate adversarial prompts. Specifically, we design a gradient-based embedding optimization method to craft reliable adversarial prompts that guide stable diffusion to generate specific images. Furthermore, after obtaining successful adversarial prompts, we reveal the mechanisms that cause the vulnerability of the model. Extensive experiments on two targeted attack tasks demonstrate the effectiveness of our method in targeted attacks. The code can be obtained in https://github.com/ datar001 / Revealing -Vulnerabilities -in -Stable-Diffusion-via-Targeted-Attacks.

show abstract

Adversarial Prompting for Black Box Foundation Models

Cited by 5 publications

References 22 publications

Red Teaming for Multimodal Large Language Models: A Survey

Red Teaming for Multimodal Large Language Models: A Survey

On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective

Foreword to the Special Issue on Photocatalysis

Contact Info

Product

Resources

About