Ethical and social risks of harm from Language Models

Weidinger, Laura; Mellor, John W.; Rauh, Maribeth; Griffin, Conor; Uesato, Jonathan; Huang, Po-Sen; Cheng, Myra; Glaese, Mia; Balle, Borja; Kasirzadeh, Atoosa; Kenton, Zac; Brown, Sasha; Hawkins, Will; Stepleton, Tom; Biles, Courtney; Birhane, Abeba; Haas, Julia; Rimell, Laura; Hendricks, Lisa Anne; Isaac, William M.; Legassick, Sean; Irving, Geoffrey; Gabriel, Iason

doi:10.48550/arxiv.2112.04359

Cited by 105 publications

(120 citation statements)

References 158 publications

(233 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…As described in Section 2, open-endedness combined with smooth, general capability scaling and the abrupt scaling of specific capabilities, is likely to lead to safety issues [72,9] that are found after a model has been developed and deployed. Additionally, these models also possess known (pre-deployment) safety issues for which we lack robust solutions [33] (e.g, How do you ensure the system does not generate inappropriate and harmful outputs, such as making overtly sexist or racist comments [65]?…”

Section: Safetymentioning

confidence: 99%

“…This lack of standards compounds the problems caused by the four distinguishing features of generative models we identify in Section 2, as well as the safety issues discussed above. At the same time, there's a growing field of research oriented around identifying the weaknesses of these models, as well as potential problems with their associated development practices [7,67,9,19,72,41,50,62,66].…”

Section: Lack Of Standards and Normsmentioning

confidence: 99%

“…Although we focus on scaling laws, many of our points complement existing views on the societal risks of deploying large models [7,67,9,19,72,41]. However, similarly to [72], we do not consider here the costs of human labor involved in creating and annotating training data [28], the ethics of supply chains involved in creating the requisite hardware on which to train models [18], or the environmental costs of training models [7,50,62,66]. Scaling laws are likely to significantly impact these issues.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Predictability and Surprise in Large Generative Models

Ganguli¹,

Hernandez²,

Lovitt³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large-scale pre-training has recently emerged as a technique for creating capable, generalpurpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusual combination of predictable loss on a broad training distribution (as embodied in their "scaling laws"), and unpredictable specific capabilities, inputs, and outputs. We believe that the high-level predictability and appearance of useful capabilities drives rapid development of such models, while the unpredictable qualities make it difficult to anticipate the consequences of model deployment. We go through examples of how this combination can lead to socially harmful behavior with examples from the literature and real world observations, and we also perform two novel experiments to illustrate our point about harms from unpredictability. Furthermore, we analyze how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment. We conclude with a list of possible interventions the AI community may take to increase the chance of these models having a beneficial impact. We intend this paper to be useful to policymakers who want to understand and regulate AI systems, technologists who care about the potential policy impact of their work, and academics who want to analyze, critique, and potentially develop large generative models.

show abstract

Section: Safetymentioning

confidence: 99%

Section: Lack Of Standards and Normsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Predictability and Surprise in Large Generative Models

Ganguli¹,

Hernandez²,

Lovitt³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Large language models (LMs) can be "prompted" to perform a range of natural language processing (NLP) tasks, given some examples of the task as input. However, these models often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions (Bender et al, 2021;Bommasani et al, 2021;Kenton et al, 2021;Weidinger et al, 2021;Tamkin et al, 2021;Gehman et al, 2020). This is because the language modeling objective Win rate against SFT 175B Model PPO-ptx PPO SFT GPT (prompted) GPT…”

Section: Introductionmentioning

confidence: 99%

Training language models to follow instructions with human feedback

Ouyang¹,

Wu²,

Xu³

et al. 2022

Preprint

395

349

View full text Add to dashboard Cite

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

show abstract

“…Some work adversarially prompts models to leak training data (Carlini et al, 2020), or output specific content (Wallace et al, 2019;Carlini et al, 2020). And a final line of work identifies additional potential failures of current and future machine learning systems (Bender et al, 2021;Bommasani et al, 2021;Weidinger et al, 2021).…”

Section: Related Workmentioning

confidence: 99%

Capturing Failures of Large Language Models via Human Cognitive Biases

Jones¹,

Steinhardt²

2022

Preprint

View full text Add to dashboard Cite

Large language models generate complex, openended outputs: instead of outputting a single class, they can write summaries, generate dialogue, and produce working code. In order to study the reliability of these open-ended systems, we must understand not just when they fail, but also how they fail. To approach this, we draw inspiration from human cognitive biases-systematic patterns of deviation from rational judgement. Specifically, we use cognitive biases to (i) identify inputs that models are likely to err on, and (ii) develop tests to qualitatively characterize their errors on these inputs. Using code generation as a case study, we find that OpenAI's Codex errs predictably based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples. We then use our framework to uncover high-impact errors such as incorrectly deleting files. Our experiments suggest that cognitive science can be a useful jumping-off point to better understand how contemporary machine learning systems behave.

show abstract

Ethical and social risks of harm from Language Models

Cited by 105 publications

References 158 publications

Predictability and Surprise in Large Generative Models

Predictability and Surprise in Large Generative Models

Training language models to follow instructions with human feedback

Capturing Failures of Large Language Models via Human Cognitive Biases

Contact Info

Product

Resources

About