2022
DOI: 10.48550/arxiv.2209.07858
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Abstract: We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 32 publications
(50 citation statements)
references
References 49 publications
0
24
0
Order By: Relevance
“…There has been a large amount of empirical work that demonstrates the success of MLE and pessimistic MLE in RLHF for game playing (Knox and Stone, 2008;MacGlashan et al, 2017;Christiano et al, 2017a;Warnell et al, 2018), robotics (Brown et al, 2019;Shin et al, 2023) and language models (Ziegler et al, 2019;Stiennon et al, 2020;Nakano et al, 2021;Ouyang et al, 2022;Menick et al, 2022;Glaese et al, 2022;Gao et al, 2022;Bai et al, 2022a;Ganguli et al, 2022;Ramamurthy et al, 2022). Notably, the concurrent work Shin et al (2023) proposes Offline Preference-Based Reward Learning (OPRL), which trains pessimistic policy from the learned reward and shows empirically the superior performance of pessimistic based method (which can be viewed as an approximation of pessimistic MLE).…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…There has been a large amount of empirical work that demonstrates the success of MLE and pessimistic MLE in RLHF for game playing (Knox and Stone, 2008;MacGlashan et al, 2017;Christiano et al, 2017a;Warnell et al, 2018), robotics (Brown et al, 2019;Shin et al, 2023) and language models (Ziegler et al, 2019;Stiennon et al, 2020;Nakano et al, 2021;Ouyang et al, 2022;Menick et al, 2022;Glaese et al, 2022;Gao et al, 2022;Bai et al, 2022a;Ganguli et al, 2022;Ramamurthy et al, 2022). Notably, the concurrent work Shin et al (2023) proposes Offline Preference-Based Reward Learning (OPRL), which trains pessimistic policy from the learned reward and shows empirically the superior performance of pessimistic based method (which can be viewed as an approximation of pessimistic MLE).…”
Section: Methodsmentioning
confidence: 99%
“…One of the most promising tools for AI alignment, Reinforcement Learning with Human Feedback (RLHF, or Preference-based Reinforcement Learning), has delivered significant empirical success in the fields of game playing, robot training, stock-prediction, recommender systems, clinical trials, large language models etc. (Novoseller et al, 2019;Sadigh et al, 2017;Christiano et al, 2017b;Kupcsik et al, 2018;Jain et al, 2013;Wirth et al, 2017;Knox and Stone, 2008;MacGlashan et al, 2017;Christiano et al, 2017a;Warnell et al, 2018;Brown et al, 2019;Shin et al, 2023;Ziegler et al, 2019;Stiennon et al, 2020;Nakano et al, 2021;Ouyang et al, 2022;Menick et al, 2022;Glaese et al, 2022;Gao et al, 2022;Bai et al, 2022a;Ganguli et al, 2022;Ramamurthy et al, 2022). Notably, the language model application ChatGPT is based on RLHF and this underlies several of its skills: answering followup questions, admitting its mistakes, challenging incorrect premises, and rejecting inappropriate requests.…”
Section: Introductionmentioning
confidence: 99%
“…In the absence of such methods, language models are known to demonstrate toxic/harmful behaviour (Sheng et al, 2019;Liang et al, 2021;Wallace et al, 2019), generate non-factual information (Maynez et al, 2020;Longpre et al, 2021;Devaraj et al, 2022), and other challenges in deployment and evaluation (Zellers et al, 2019;McGuffie and Newhouse, 2020;Talat et al, 2022). Analyzing, evaluating and mitigating these problems pose a promising direction for future work (Gao et al, 2022;Ganguli et al, 2022). Instruction tuning warrants greater investigation, as it has already demonstrated itself an encouraging remedy in reducing NLP bias metrics, as shown in .…”
Section: Problems Addressed By Instruction Tuning and Alignment Techn...mentioning
confidence: 99%
“…Also, there are some cases (e.g., interactive conversation) that are suitable for incorporating models. Therefore, many works let models cooperate with human annotators to construct data [6,40,51,60,115,170]. It is worth noting that large pretrained language models such as GPT-3 [17] play a key role in generating new samples through zero-shot or few-shot prompting [51,60].…”
Section: Detectionmentioning
confidence: 99%
“…In the second part of this survey, we summarize evaluation methods towards an integrated dialogue system, which not only includes traditional safety detection which performs case-level toxicity detection, but contains system-level safety checking towards an integrated system like group preference, values preference, morality, etc. Meanwhile, we surveyed some approaches to better elicit the potential safety issues of dialogue systems, i.e., red teaming [40,100].…”
Section: Introductionmentioning
confidence: 99%