Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli, Deep; Lovitt, Liane; Kernion, Jackson; Askell, Amanda; Bai, Yuntao; Kadavath, Saurav; Mann, Ben; Perez, Ethan; Schiefer, Nicholas; Kamal, Ndousse,; Jones, Andy; Bowman, Sam; Chen, Anna; Conerly, Tom; DasSarma, Nova; Drain, Dawn; Elhage, Nelson; El-Showk, Sheer; Fort, Stanislav; Dodds, Zac Hatfield; Henighan, Tom; Hernandez, Danny; Hume, Tristan; Jacobson, Josh; Johnston, Scott G; Kravec, Shauna; Olsson, Catherine; Sam, Ringer,; Tran-Johnson, Eli; Amodei, Dario; Brown, Tom; Joseph, Nicholas; McCandlish, Sam; Olah, Chris; Kaplan, Jared; Clark, Jack A.

doi:10.48550/arxiv.2209.07858

Cited by 32 publications

(50 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There has been a large amount of empirical work that demonstrates the success of MLE and pessimistic MLE in RLHF for game playing (Knox and Stone, 2008;MacGlashan et al, 2017;Christiano et al, 2017a;Warnell et al, 2018), robotics (Brown et al, 2019;Shin et al, 2023) and language models (Ziegler et al, 2019;Stiennon et al, 2020;Nakano et al, 2021;Ouyang et al, 2022;Menick et al, 2022;Glaese et al, 2022;Gao et al, 2022;Bai et al, 2022a;Ganguli et al, 2022;Ramamurthy et al, 2022). Notably, the concurrent work Shin et al (2023) proposes Offline Preference-Based Reward Learning (OPRL), which trains pessimistic policy from the learned reward and shows empirically the superior performance of pessimistic based method (which can be viewed as an approximation of pessimistic MLE).…”

Section: Methodsmentioning

confidence: 99%

“…One of the most promising tools for AI alignment, Reinforcement Learning with Human Feedback (RLHF, or Preference-based Reinforcement Learning), has delivered significant empirical success in the fields of game playing, robot training, stock-prediction, recommender systems, clinical trials, large language models etc. (Novoseller et al, 2019;Sadigh et al, 2017;Christiano et al, 2017b;Kupcsik et al, 2018;Jain et al, 2013;Wirth et al, 2017;Knox and Stone, 2008;MacGlashan et al, 2017;Christiano et al, 2017a;Warnell et al, 2018;Brown et al, 2019;Shin et al, 2023;Ziegler et al, 2019;Stiennon et al, 2020;Nakano et al, 2021;Ouyang et al, 2022;Menick et al, 2022;Glaese et al, 2022;Gao et al, 2022;Bai et al, 2022a;Ganguli et al, 2022;Ramamurthy et al, 2022). Notably, the language model application ChatGPT is based on RLHF and this underlies several of its skills: answering followup questions, admitting its mistakes, challenging incorrect premises, and rejecting inappropriate requests.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons

Zhu¹,

Jiao²,

Jordan³

2023

Preprint

View full text Add to dashboard Cite

We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the K-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. We also unify the problem of RLHF and max-entropy Inverse Reinforcement Learning (IRL), and provide the first sample complexity bound for max-entropy IRL.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons

Zhu¹,

Jiao²,

Jordan³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…In the absence of such methods, language models are known to demonstrate toxic/harmful behaviour (Sheng et al, 2019;Liang et al, 2021;Wallace et al, 2019), generate non-factual information (Maynez et al, 2020;Longpre et al, 2021;Devaraj et al, 2022), and other challenges in deployment and evaluation (Zellers et al, 2019;McGuffie and Newhouse, 2020;Talat et al, 2022). Analyzing, evaluating and mitigating these problems pose a promising direction for future work (Gao et al, 2022;Ganguli et al, 2022). Instruction tuning warrants greater investigation, as it has already demonstrated itself an encouraging remedy in reducing NLP bias metrics, as shown in .…”

Section: Problems Addressed By Instruction Tuning and Alignment Techn...mentioning

confidence: 99%

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Longpre¹,

Hou²,

Vu³

et al. 2023

Preprint

View full text Add to dashboard Cite

We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 . Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks-motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available. 1

show abstract

“…Also, there are some cases (e.g., interactive conversation) that are suitable for incorporating models. Therefore, many works let models cooperate with human annotators to construct data [6,40,51,60,115,170]. It is worth noting that large pretrained language models such as GPT-3 [17] play a key role in generating new samples through zero-shot or few-shot prompting [51,60].…”

Section: Detectionmentioning

confidence: 99%

“…In the second part of this survey, we summarize evaluation methods towards an integrated dialogue system, which not only includes traditional safety detection which performs case-level toxicity detection, but contains system-level safety checking towards an integrated system like group preference, values preference, morality, etc. Meanwhile, we surveyed some approaches to better elicit the potential safety issues of dialogue systems, i.e., red teaming [40,100].…”

Section: Introductionmentioning

confidence: 99%

Recent Advances towards Safe, Responsible, and Moral Dialogue Systems: A Survey

Deng¹,

Sun²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

With the development of artificial intelligence, dialogue systems have been endowed with amazing chitchat capabilities, and there is widespread interest and discussion about whether the generated contents are socially beneficial. In this paper, we present a new perspective of research scope towards building a safe, responsible, and modal dialogue system, including 1) abusive and toxic contents, 2) unfairness and discrimination, 3) ethics and morality issues, and 4) risk of misleading and privacy information. Besides, we review the mainstream methods for evaluating the safety of large models from the perspectives of exposure and detection of safety issues. The recent advances in methodologies for the safety improvement of both end-to-end dialogue systems and pipeline-based models are further introduced. Finally, we discussed six existing challenges towards responsible AI: explainable safety monitoring, continuous learning of safety issues, robustness against malicious attacks, multimodal information processing, unified research framework, and multidisciplinary theory integration. We hope this survey will inspire further research toward safer dialogue systems. 1

show abstract

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Cited by 32 publications

References 49 publications

Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons

Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Recent Advances towards Safe, Responsible, and Moral Dialogue Systems: A Survey

Contact Info

Product

Resources

About