Risks from Learned Optimization in Advanced Machine Learning Systems

Evan, Hubinger,; Merwijk, Chris van; Mikulik, Vladimir; Skalse, Joar; Garrabrant, Scott

doi:10.48550/arxiv.1906.01820

Cited by 21 publications

(34 citation statements)

References 5 publications

(11 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Of particular concern is when an agent is optimizing for the wrong thing when out of distribution. Hubinger et al (2019) introduce the concept of a mesa-optimizer -a learnt model which is itself an optimizer for some mesa-objective, which may differ from the base-objective used to train the model, when deployed outside of the training environment. This leads to the so-called inner alignment problem:…”

Section: Inner Alignmentmentioning

confidence: 99%

“…Of particular concern is deceptive alignment (Hubinger et al, 2019), where the mesa-optimizer acts as if it's optimizing the base objective as an instrumental goal, whereas its actual mesa-objective is different.…”

Section: Inner Alignmentmentioning

confidence: 99%

“…One suggestion for addressing the inner alignment problem involves using interpretability tools for evaluating and performing adversarial training (Hubinger, 2019). There are a number of works on interpretability and analysis tools for NLP, see for example the survey of Belinkov and Glass (2019).…”

Section: Approaches To Alignmentmentioning

confidence: 99%

See 2 more Smart Citations

Alignment of Language Agents

Kenton,

Everitt,

Weidinger

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

For artificial intelligence to be beneficial to humans the behaviour of AI agents needs to be aligned with what humans want. In this paper we discuss some behavioural issues for language agents, arising from accidental misspecification by the system designer. We highlight some ways that misspecification can occur and discuss some behavioural issues that could arise from misspecification, including deceptive or manipulative language, and review some approaches for avoiding these issues.

show abstract

Section: Inner Alignmentmentioning

confidence: 99%

Section: Inner Alignmentmentioning

confidence: 99%

See 1 more Smart Citation

Alignment of Language Agents

Kenton,

Everitt,

Weidinger

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Understanding such levels gives insight into how open-ended search can diverge from the system designer's intents, creating potential safety hazards. Note that the categorization presented next is adapted from previous AI safety categorizations (Ortega and Maini, 2018;Hubinger et al, 2019).…”

Section: How Safety Issues Emerge In Open-ended Searchmentioning

confidence: 99%

“…From this point of view, nearly everything of moral worth results from humanity transcending the explicit incentives of the search algorithm. In contrast, a central focus within top-down AI safety is to explicitly align an AI's incentives with our own (Hubinger et al, 2019;Taylor et al, 2016a), e.g. by modeling human preferences to use as an objective function (Leike et al, 2018), or to be cautious of divergences between explicit incentives and agent incentives (Hubinger et al, 2019).…”

Section: Case Study: Biological Evolution and Ai Safetymentioning

confidence: 99%

Open Questions in Creating Safe Open-ended AI: Tensions Between Control and Creativity

Ecoffet¹,

Clune²,

Lehman³

2020

Preprint

View full text Add to dashboard Cite

Artificial life originated and has long studied the topic of open-ended evolution, which seeks the principles underlying artificial systems that innovate continually, inspired by biological evolution. Recently, interest has grown within the broader field of AI in a generalization of open-ended evolution, here called open-ended search, wherein such questions of open-endedness are explored for advancing AI, whatever the nature of the underlying search algorithm (e.g. evolutionary or gradient-based). For example, open-ended search might design new architectures for neural networks, new reinforcement learning algorithms, or most ambitiously, aim at designing artificial general intelligence. This paper proposes that open-ended evolution and artificial life have much to contribute towards the understanding of open-ended AI, focusing here in particular on the safety of open-ended search.The idea is that AI systems are increasingly applied in the real world, often producing unintended harms in the process, which motivates the growing field of AI safety. This paper argues that open-ended AI has its own safety challenges, in particular, whether the creativity of open-ended systems can be productively and predictably controlled. This paper explains how unique safety problems manifest in open-ended search, and suggests concrete contributions and research questions to explore them. The hope is to inspire progress towards creative, useful, and safe open-ended search algorithms.

show abstract

Advanced artificial agents intervene in the provision of reward

2022

View full text Add to dashboard Cite

We analyze the expected behavior of an advanced artificial agent with a learned goal planning in an unknown environment. Given a few assumptions, we argue that it will encounter a fundamental ambiguity in the data about its goal. For example, if we provide a large reward to indicate that something about the world is satisfactory to us, it may hypothesize that what satisfied us was the sending of the reward itself; no observation can refute that. Then we argue that this ambiguity will lead it to intervene in whatever protocol we set up to provide data for the agent about its goal. We discuss an analogous failure mode of approximate solutions to assistance games. Finally, we briefly review some recent approaches that may avoid this problem.

show abstract

Risks from Learned Optimization in Advanced Machine Learning Systems

Cited by 21 publications

References 5 publications

Alignment of Language Agents

Alignment of Language Agents

Open Questions in Creating Safe Open-ended AI: Tensions Between Control and Creativity

Advanced artificial agents intervene in the provision of reward

Contact Info

Product

Resources

About