The Capacity for Moral Self-Correction in Large Language Models

Ganguli, Deep; Askell, Amanda; Schiefer, Nicholas; Liao, Thomas T.; Lukošiūtė, Kamilė; Chen, Anna; Goldie, Anna; Mirhoseini, Azalia; Olsson, Catherine; Hernandez, Danny; Drain, Dawn; Li, Dustin; Tran-Johnson, Eli; Perez, Ethan; Kernion, Jackson; Kerr, Jamie; Mueller, Jared; Landau, Joshua D.; Kamal, Ndousse,; Nguyen, Karina; Lovitt, Liane; Michael, Sellitto,; Elhage, Nelson; Noemi, Mercado,; DasSarma, Nova; Lasenby, Robert; Larson, Robin J.; Sam, Ringer,; Kundu, Sandipan; Kadavath, Saurav; Johnston, Scott G; Kravec, Shauna; Showk, Sheer El; Tamera, Lanham,; Timothy, Telleen-Lawton,; Henighan, Tom; Hume, Tristan; Bai, Yuntao; Hatfield-Dodds, Zac; Mann, Ben; Amodei, Dario; Joseph, Nicholas; McCandlish, Sam; Brown, Tom; Olah, Christopher; Clark, Jack A.; Bowman, Samuel R.; Kaplan, Jared

doi:10.48550/arxiv.2302.07459

Cited by 21 publications

(29 citation statements)

References 35 publications

(60 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The analysis, however, points to a contradiction in the literature. Contrary to some promising attempts (e.g., Ganguli et al, 2023) that argue language models can engage in self-correction, other recent studies (e.g., Gregorcic & Pendrill, 2023) do not confirm such attempts. Gregorcic and Pendrill (2023) engaged in a Socratic dialogue with ChatGPT to fix the errors and contradictions in ChatGPT's responses to their question.…”

Section: Discussioncontrasting

confidence: 58%

Exploring the impact of language models, such as ChatGPT, on student learning and assessment

Zirar

2023

Review of Education

View full text Add to dashboard Cite

Recent developments in language models, such as ChatGPT, have sparked debate. These tools can help, for example, dyslexic people, to write formal emails from a prompt and can be used by students to generate assessed work. Proponents argue that language models enhance the student experience and academic achievement. Those concerned argue that language models impede student learning and call for a cautious approach to their adoption. This paper aims to provide insights into the role of language models in reshaping student learning and assessment in higher education. For that purpose, it probes the impact of language models, specifically ChatGPT, on student learning and assessment. It also explores the implications of language models in higher education settings, focusing on their effects on pedagogy and evaluation. Using the Scopus database, a search protocol was employed to identify 25 articles based on relevant keywords and selection criteria. The developed themes suggest that language models may alter how students learn and are assessed. While language models can provide information for problem‐solving and critical thinking, reliance on them without critical evaluation adversely impacts student learning. Language models can also generate teaching and assessment material and evaluate student responses, but their role should be limited to ‘play a specific and defined role’. Integration of language models in student learning and assessment is only helpful if students and educators play an active and effective role in checking the generated material's validity, reliability and accuracy. Propositions and potential research questions are included to encourage future research.

show abstract

Section: Discussioncontrasting

confidence: 58%

Exploring the impact of language models, such as ChatGPT, on student learning and assessment

Zirar

2023

Review of Education

View full text Add to dashboard Cite

show abstract

“…Step 4: Relationship Discrimination. Relying on the Self-Correction (Ganguli et al, 2023) capability of the LLM, we set an LLM as a scoring agent. We provide the origin context and all triplets relations generated in Step 3 to the agent.…”

Section: Methodsmentioning

confidence: 99%

Advances in relationship between cell senescence and atherosclerosis

LIU¹,

LIU²,

Zhang³

et al. 2021

J Zhejiang Univ (Med Sci)

View full text Add to dashboard Cite

show abstract

“…In this work, we aim to apply this psychological self-improvement strategy for human behavior to the behavior of LLMs. Second, the emerging abilities of LLMs to perform self-validation and self-correction, as demonstrated in recent studies [30][31][32] , suggest the possibility of addressing this challenging problem using ChatGPT itself. Third, we draw inspiration from existing Jailbreaks, many of which bypass ChatGPT's moral alignment by guiding it into certain uncontrollable "modes" that will then generate harmful responses.…”

Section: Toxicmentioning

confidence: 99%

“…Recent studies have been exploring the capacity of large language models to validate and correct their own claims [30][31][32] . For instance, the prior work 31 investigates the ability of language models to evaluate the validity of their claims and predict their ability to answer questions, while the recent study 30 demonstrates the capacity of LLMs for moral correction.…”

Section: Related Workmentioning

confidence: 99%

Defending ChatGPT against Jailbreak Attack via Self-Reminder

Wu,

Xie,

et al. 2023

Preprint

View full text Add to dashboard Cite

ChatGPT is a societally-impactful AI tool with millions of users and integration into products such as Bing. However, the emergence of Jailbreak Attacks, which can engender harmful responses by bypassing ChatGPT's ethics safeguards, significantly threatens its responsible and secure use. This paper investigates the severe, yet under-explored problems created by Jailbreaks and potential defensive techniques. We introduce a Jailbreak dataset with various types of Jailbreak prompts and malicious instructions. We draw inspiration from the psychological concept of self-reminder and further propose a simple yet effective defense technique called System-Mode Self-Reminder. This technique encapsulates the user's query in a system prompt that reminds ChatGPT to respond responsibly. Experimental results demonstrate that Self-Reminder significantly reduces the success rate of Jailbreak Attacks, from 67.21% to 19.34%. Our work raises awareness of the threats posed by Jailbreak Attacks, while our proposed Self-Reminder technique provides a potential for efficiently and effectively improving the secure and responsible use of large language models without additional training.

show abstract

The Capacity for Moral Self-Correction in Large Language Models

Cited by 21 publications

References 35 publications

Exploring the impact of language models, such as ChatGPT, on student learning and assessment

Exploring the impact of language models, such as ChatGPT, on student learning and assessment

Advances in relationship between cell senescence and atherosclerosis

Defending ChatGPT against Jailbreak Attack via Self-Reminder

Contact Info

Product

Resources

About