Defending ChatGPT against Jailbreak Attack via Self-Reminder

Wu, Fangzhao; Xie, Yueqi; Yi, Jingwei; Shao, Jiawei; Curl, Justin; Lyu, Lingjuan; Chen, Qifeng; Xie, Xing

doi:10.21203/rs.3.rs-2873090/v1

Cited by 2 publications

(1 citation statement)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Wu et al [10] delve into defending ChatGPT against Jailbreak Attack in their paper through a technique called System-Mode Self-Reminder. This method significantly lowers the success rate of Jailbreak Attacks, emphasizing the importance of proactive and innovative defense strategies in safeguarding LLMs against emerging threats.…”

Section: Other Related Literaturementioning

confidence: 99%

GUARDIAN: A Multi-Tiered Defense Architecture for Thwarting Prompt Injection Attacks on LLMs

Rai,

Sood,

Madisetti

et al. 2024

JSEA

View full text Add to dashboard Cite

This paper introduces a novel multi-tiered defense architecture to protect language models from adversarial prompt attacks. We construct adversarial prompts using strategies like role emulation and manipulative assistance to simulate real threats. We introduce a comprehensive, multi-tiered defense framework named GUARDIAN (Guardrails for Upholding Ethics in Language Models) comprising a system prompt filter, pre-processing filter leveraging a toxic classifier and ethical prompt generator, and pre-display filter using the model itself for output screening. Extensive testing on Meta's Llama-2 model demonstrates the capability to block 100% of attack prompts. The approach also auto-suggests safer prompt alternatives, thereby bolstering language model security. Quantitatively evaluated defense layers and an ethical substitution mechanism represent key innovations to counter sophisticated attacks. The integrated methodology not only fortifies smaller LLMs against emerging cyber threats but also guides the broader application of LLMs in a secure and ethical manner.

show abstract

Section: Other Related Literaturementioning

confidence: 99%