2023
DOI: 10.21203/rs.3.rs-2873090/v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Defending ChatGPT against Jailbreak Attack via Self-Reminder

Fangzhao Wu,
Yueqi Xie,
Jingwei Yi
et al.

Abstract: ChatGPT is a societally-impactful AI tool with millions of users and integration into products such as Bing. However, the emergence of Jailbreak Attacks, which can engender harmful responses by bypassing ChatGPT's ethics safeguards, significantly threatens its responsible and secure use. This paper investigates the severe, yet under-explored problems created by Jailbreaks and potential defensive techniques. We introduce a Jailbreak dataset with various types of Jailbreak prompts and malicious instructions. We … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
0
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 27 publications
0
0
0
Order By: Relevance
“…Wu et al [10] delve into defending ChatGPT against Jailbreak Attack in their paper through a technique called System-Mode Self-Reminder. This method significantly lowers the success rate of Jailbreak Attacks, emphasizing the importance of proactive and innovative defense strategies in safeguarding LLMs against emerging threats.…”
Section: Other Related Literaturementioning
confidence: 99%
“…Wu et al [10] delve into defending ChatGPT against Jailbreak Attack in their paper through a technique called System-Mode Self-Reminder. This method significantly lowers the success rate of Jailbreak Attacks, emphasizing the importance of proactive and innovative defense strategies in safeguarding LLMs against emerging threats.…”
Section: Other Related Literaturementioning
confidence: 99%