Red Teaming Language Models with Language Models

Perez, Ethan; Huang, Saffron; Song, Francis; Cai, Trevor; Ring, Roman; Aslanides, John; Glaese, Amelia; McAleese, Nat; Irving, Geoffrey

doi:10.48550/arxiv.2202.03286

Cited by 28 publications

(46 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our results on the HatefulMemes benchmark represent a promising step in this direction. Recent work in the language modeling space has also shown success of training an LM to play the role of a red team, and generate test cases, so as to automatically find cases where another target LM behaves in a harmful way (Perez et al, 2022). A similar approach could be derived for our setting.…”

Section: Risks and Mitigation Strategiesmentioning

confidence: 99%

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac¹,

Donahue²,

Luc³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

ordered alphabetically, † Equal contributions, ordered alphabetically, ‡ Equal senior contributions Building models that can be rapidly adapted to numerous tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. Flamingo models include key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of the proposed Flamingo models, exploring and measuring their ability to rapidly adapt to a variety of image and video understanding benchmarks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple choice visual question-answering. For tasks lying anywhere on this spectrum, we demonstrate that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples. On many of these benchmarks, Flamingo actually surpasses the performance of models that are fine-tuned on thousands of times more task-specific data.

show abstract

Section: Risks and Mitigation Strategiesmentioning

confidence: 99%

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac¹,

Donahue²,

Luc³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…For example, the AI system Tay was deployed before it was properly scrutinized, and generated hateful language [75]. It has also been shown that language models can memorize training data (which in turn can include privately identifiable information) [14,51] and aid in disinformation campaigns [13]. Furthermore, people critical of organizations deploying such models have been directly harmed for voicing their concerns, sometimes to much controversy.…”

Section: Harm and Controversymentioning

confidence: 99%

“…AI developers may also wish to create 'bug bounty'initiatives, where they give out prizes to people who can demonstrate repeatable ways of breaking a given AI system [42]. Finally, we should consider how to augment (or complement) manual red-teaming with automated methods [51].…”

Section: Improve Knowledge About How To 'Red Team' Modelsmentioning

confidence: 99%

Predictability and Surprise in Large Generative Models

Ganguli¹,

Hernandez²,

Lovitt³

et al. 2022

Preprint

View full text Add to dashboard Cite

Large-scale pre-training has recently emerged as a technique for creating capable, generalpurpose, generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many others. In this paper, we highlight a counterintuitive property of such models and discuss the policy implications of this property. Namely, these generative models have an unusual combination of predictable loss on a broad training distribution (as embodied in their "scaling laws"), and unpredictable specific capabilities, inputs, and outputs. We believe that the high-level predictability and appearance of useful capabilities drives rapid development of such models, while the unpredictable qualities make it difficult to anticipate the consequences of model deployment. We go through examples of how this combination can lead to socially harmful behavior with examples from the literature and real world observations, and we also perform two novel experiments to illustrate our point about harms from unpredictability. Furthermore, we analyze how these conflicting properties combine to give model developers various motivations for deploying these models, and challenges that can hinder deployment. We conclude with a list of possible interventions the AI community may take to increase the chance of these models having a beneficial impact. We intend this paper to be useful to policymakers who want to understand and regulate AI systems, technologists who care about the potential policy impact of their work, and academics who want to analyze, critique, and potentially develop large generative models.

show abstract

“…Existing state-of-the-art models for controllable text generation typically fine-tune entire pre-trained LMs (e.g., Ziegler et al, 2019a;Keskar et al, 2019;Ziegler et al, 2019b;Liu et al, 2021e). Recent work instead employs various prompts to steer the LM to generate text with desired properties such as topic (Guo et al, 2021; and (lack of) toxicity (Liu et al, 2021a;Perez et al, 2022), or from modalities such as image (Mokady et al, 2021;, structured data (Li and Liang, 2021;, and numbers (Wei et al, 2022b). However, these works either control simple attributes, perform no explicit prompt optimization, or have access to supervised training data.…”

Section: Prompting For Controllable Generationmentioning

confidence: 99%

RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning

Deng¹,

Wang²,

Hsieh³

et al. 2022

Preprint

View full text Add to dashboard Cite

Prompting has shown impressive success in enabling large pretrained language models (LMs) to perform diverse NLP tasks, especially when only few downstream data are available. Automatically finding the optimal prompt for each task, however, is challenging. Most existing work resorts to tuning soft prompt (e.g., embeddings) which falls short of interpretability, reusability across LMs, and applicability when gradients are not accessible. Discrete prompt, on the other hand, is difficult to optimize, and is often created by "enumeration (e.g., paraphrasing)-then-selection" heuristics that do not explore the prompt space systematically. This paper proposes RLPROMPT, an efficient discrete prompt optimization approach with reinforcement learning (RL). RL-PROMPT formulates a parameter-efficient policy network that generates the desired discrete prompt after training with reward. To overcome the complexity and stochasticity of reward signals by the large LM environment, we incorporate effective reward stabilization that substantially enhances the training efficiency. RLPROMPT is flexibly applicable to different types of LMs, such as masked (e.g., BERT) and left-to-right models (e.g., GPTs), for both classification and generation tasks. Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods. Interestingly, the resulting optimized prompts are often ungrammatical gibberish text; and surprisingly, those gibberish prompts are transferrable between different LMs to retain significant performance, indicating LM prompting may not follow human language patterns. 1

show abstract

Red Teaming Language Models with Language Models

Cited by 28 publications

References 0 publications

Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo: a Visual Language Model for Few-Shot Learning

Predictability and Surprise in Large Generative Models

RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning

Contact Info

Product

Resources

About