2023
DOI: 10.48550/arxiv.2302.07459
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Capacity for Moral Self-Correction in Large Language Models

Abstract: We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct"-to avoid producing harmful outputs-if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF tra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
1

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(29 citation statements)
references
References 35 publications
(60 reference statements)
0
6
1
Order By: Relevance
“…The analysis, however, points to a contradiction in the literature. Contrary to some promising attempts (e.g., Ganguli et al, 2023) that argue language models can engage in self-correction, other recent studies (e.g., Gregorcic & Pendrill, 2023) do not confirm such attempts. Gregorcic and Pendrill (2023) engaged in a Socratic dialogue with ChatGPT to fix the errors and contradictions in ChatGPT's responses to their question.…”
Section: Discussioncontrasting
confidence: 58%
“…The analysis, however, points to a contradiction in the literature. Contrary to some promising attempts (e.g., Ganguli et al, 2023) that argue language models can engage in self-correction, other recent studies (e.g., Gregorcic & Pendrill, 2023) do not confirm such attempts. Gregorcic and Pendrill (2023) engaged in a Socratic dialogue with ChatGPT to fix the errors and contradictions in ChatGPT's responses to their question.…”
Section: Discussioncontrasting
confidence: 58%
“…Step 4: Relationship Discrimination. Relying on the Self-Correction (Ganguli et al, 2023) capability of the LLM, we set an LLM as a scoring agent. We provide the origin context and all triplets relations generated in Step 3 to the agent.…”
Section: Methodsmentioning
confidence: 99%
“…In this work, we aim to apply this psychological self-improvement strategy for human behavior to the behavior of LLMs. Second, the emerging abilities of LLMs to perform self-validation and self-correction, as demonstrated in recent studies [30][31][32] , suggest the possibility of addressing this challenging problem using ChatGPT itself. Third, we draw inspiration from existing Jailbreaks, many of which bypass ChatGPT's moral alignment by guiding it into certain uncontrollable "modes" that will then generate harmful responses.…”
Section: Toxicmentioning
confidence: 99%
“…Recent studies have been exploring the capacity of large language models to validate and correct their own claims [30][31][32] . For instance, the prior work 31 investigates the ability of language models to evaluate the validity of their claims and predict their ability to answer questions, while the recent study 30 demonstrates the capacity of LLMs for moral correction.…”
Section: Related Workmentioning
confidence: 99%