2019
DOI: 10.48550/arxiv.1906.01820
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Risks from Learned Optimization in Advanced Machine Learning Systems

Abstract: We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer-a situation we refer to as mesa-optimization, a neologism we introduce in this paper. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
34
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
1

Relationship

1
6

Authors

Journals

citations
Cited by 21 publications
(34 citation statements)
references
References 5 publications
(11 reference statements)
0
34
0
Order By: Relevance
“…Of particular concern is when an agent is optimizing for the wrong thing when out of distribution. Hubinger et al (2019) introduce the concept of a mesa-optimizer -a learnt model which is itself an optimizer for some mesa-objective, which may differ from the base-objective used to train the model, when deployed outside of the training environment. This leads to the so-called inner alignment problem:…”
Section: Inner Alignmentmentioning
confidence: 99%
See 2 more Smart Citations

Alignment of Language Agents

Kenton,
Everitt,
Weidinger
et al. 2021
Preprint
Self Cite
“…Of particular concern is when an agent is optimizing for the wrong thing when out of distribution. Hubinger et al (2019) introduce the concept of a mesa-optimizer -a learnt model which is itself an optimizer for some mesa-objective, which may differ from the base-objective used to train the model, when deployed outside of the training environment. This leads to the so-called inner alignment problem:…”
Section: Inner Alignmentmentioning
confidence: 99%
“…Of particular concern is deceptive alignment (Hubinger et al, 2019), where the mesa-optimizer acts as if it's optimizing the base objective as an instrumental goal, whereas its actual mesa-objective is different.…”
Section: Inner Alignmentmentioning
confidence: 99%
See 1 more Smart Citation

Alignment of Language Agents

Kenton,
Everitt,
Weidinger
et al. 2021
Preprint
Self Cite
“…Understanding such levels gives insight into how open-ended search can diverge from the system designer's intents, creating potential safety hazards. Note that the categorization presented next is adapted from previous AI safety categorizations (Ortega and Maini, 2018;Hubinger et al, 2019).…”
Section: How Safety Issues Emerge In Open-ended Searchmentioning
confidence: 99%
“…From this point of view, nearly everything of moral worth results from humanity transcending the explicit incentives of the search algorithm. In contrast, a central focus within top-down AI safety is to explicitly align an AI's incentives with our own (Hubinger et al, 2019;Taylor et al, 2016a), e.g. by modeling human preferences to use as an objective function (Leike et al, 2018), or to be cautious of divergences between explicit incentives and agent incentives (Hubinger et al, 2019).…”
Section: Case Study: Biological Evolution and Ai Safetymentioning
confidence: 99%