2021
DOI: 10.48550/arxiv.2103.14659
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Alignment of Language Agents

Zachary Kenton,
Tom Everitt,
Laura Weidinger
et al.

Abstract: For artificial intelligence to be beneficial to humans the behaviour of AI agents needs to be aligned with what humans want. In this paper we discuss some behavioural issues for language agents, arising from accidental misspecification by the system designer. We highlight some ways that misspecification can occur and discuss some behavioural issues that could arise from misspecification, including deceptive or manipulative language, and review some approaches for avoiding these issues.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
31
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 22 publications
(38 citation statements)
references
References 42 publications
(65 reference statements)
0
31
0
Order By: Relevance
“…Our methodology could be used to encode such different notions, but any single safety objective and fine-tuning dataset will not be able to simultaneously accommodate divergent cultural norms. Developing richer definitions and taxonomies of dialog agent behaviors, such as how polite behavior should be operationalized, is important for avoiding misspecification [104] and testing whether model behavior aligns with politeness norms in defined application contexts.…”
Section: Safety As a Concept And A Metricmentioning
confidence: 99%
“…Our methodology could be used to encode such different notions, but any single safety objective and fine-tuning dataset will not be able to simultaneously accommodate divergent cultural norms. Developing richer definitions and taxonomies of dialog agent behaviors, such as how polite behavior should be operationalized, is important for avoiding misspecification [104] and testing whether model behavior aligns with politeness norms in defined application contexts.…”
Section: Safety As a Concept And A Metricmentioning
confidence: 99%
“…The prompt conditions the model's prior over responses but does not result in a consistently reliable or factual dialogue model. We refer the reader to Weidinger et al (2021) for a detailed discussion on language model harms specific to dialogue and we discuss some ideas regarding building trustworthy systems in Section 7.3.…”
Section: Prompt Generationmentioning
confidence: 99%
“…LLMs are trained infrequently due to their expense, so mistakes are slow to correct during pre-training but fast to correct if mitigations are applied downstream. Fast iteration is critical when factual information changes (Lazaridou et al, 2021), societal values change (Weidinger et al, 2021), or our knowledge about how to mitigate harms changes. In particular, accidental censoring of data can damage performance for language by or about marginalized groups (Dodge et al, 2021;Welbl et al, 2021;.…”
Section: Safety Benefits and Safety Risksmentioning
confidence: 99%
“…This lack of standards compounds the problems caused by the four distinguishing features of generative models we identify in Section 2, as well as the safety issues discussed above. At the same time, there's a growing field of research oriented around identifying the weaknesses of these models, as well as potential problems with their associated development practices [7,67,9,19,72,41,50,62,66].…”
Section: Lack Of Standards and Normsmentioning
confidence: 99%
“…Although we focus on scaling laws, many of our points complement existing views on the societal risks of deploying large models [7,67,9,19,72,41]. However, similarly to [72], we do not consider here the costs of human labor involved in creating and annotating training data [28], the ethics of supply chains involved in creating the requisite hardware on which to train models [18], or the environmental costs of training models [7,50,62,66].…”
Section: Introductionmentioning
confidence: 99%