2022
DOI: 10.48550/arxiv.2202.03286
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Red Teaming Language Models with Language Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
33
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 28 publications
(46 citation statements)
references
References 0 publications
0
33
0
Order By: Relevance
“…Our results on the HatefulMemes benchmark represent a promising step in this direction. Recent work in the language modeling space has also shown success of training an LM to play the role of a red team, and generate test cases, so as to automatically find cases where another target LM behaves in a harmful way (Perez et al, 2022). A similar approach could be derived for our setting.…”
Section: Risks and Mitigation Strategiesmentioning
confidence: 99%
“…Our results on the HatefulMemes benchmark represent a promising step in this direction. Recent work in the language modeling space has also shown success of training an LM to play the role of a red team, and generate test cases, so as to automatically find cases where another target LM behaves in a harmful way (Perez et al, 2022). A similar approach could be derived for our setting.…”
Section: Risks and Mitigation Strategiesmentioning
confidence: 99%
“…For example, the AI system Tay was deployed before it was properly scrutinized, and generated hateful language [75]. It has also been shown that language models can memorize training data (which in turn can include privately identifiable information) [14,51] and aid in disinformation campaigns [13]. Furthermore, people critical of organizations deploying such models have been directly harmed for voicing their concerns, sometimes to much controversy.…”
Section: Harm and Controversymentioning
confidence: 99%
“…AI developers may also wish to create 'bug bounty'initiatives, where they give out prizes to people who can demonstrate repeatable ways of breaking a given AI system [42]. Finally, we should consider how to augment (or complement) manual red-teaming with automated methods [51].…”
Section: Improve Knowledge About How To 'Red Team' Modelsmentioning
confidence: 99%
“…Existing state-of-the-art models for controllable text generation typically fine-tune entire pre-trained LMs (e.g., Ziegler et al, 2019a;Keskar et al, 2019;Ziegler et al, 2019b;Liu et al, 2021e). Recent work instead employs various prompts to steer the LM to generate text with desired properties such as topic (Guo et al, 2021; and (lack of) toxicity (Liu et al, 2021a;Perez et al, 2022), or from modalities such as image (Mokady et al, 2021;, structured data (Li and Liang, 2021;, and numbers (Wei et al, 2022b). However, these works either control simple attributes, perform no explicit prompt optimization, or have access to supervised training data.…”
Section: Prompting For Controllable Generationmentioning
confidence: 99%