Scalar reward is not enough: a response to Silver, Singh, Precup and Sutton (2021)

Vamplew, Peter; Smith, Brien; Källström, Johan; Ramos, Grasieli de Oliveira; Rădulescu, Roxana; Roijers, Diederik M.; Hayes, Conor F.; Heintz, Fredrik; Mannion, Patrick; Libin, Pieter; Dazeley, Richard; Foale, Cameron

doi:10.1007/s10458-022-09575-5

Cited by 18 publications

(12 citation statements)

References 74 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thirdly, in the context of modeling human values, this approach might sometimes be more consistent with human value processing [28]. At almost any level of analysis possible, human intelligence is multi-objective [32]. Biological life uses a set of multi-objective homeostatic systems to prioritize acquiring resources that are needed most given the organism's state [24].…”

Section: Design Principles Research Contextmentioning

confidence: 99%

Using soft maximin for risk averse multi-objective decision-making

Smith

Klassert

Roland³

2022

Auton Agent Multi-Agent Syst

View full text Add to dashboard Cite

Balancing multiple competing and conflicting objectives is an essential task for any artificial intelligence tasked with satisfying human values or preferences. Conflict arises both from misalignment between individuals with competing values, but also between conflicting value systems held by a single human. Starting with principle of loss-aversion, we designed a set of soft maximin function approaches to multi-objective decision-making. Bench-marking these functions in a set of previously-developed environments, we found that one new approach in particular, ‘split-function exp-log loss aversion’ (SFELLA), learns faster than the state of the art thresholded alignment objective method Vamplew (Engineering Applications of Artificial Intelligenceg 100:104186, 2021) on three of four tasks it was tested on, and achieved the same optimal performance after learning. SFELLA also showed relative robustness improvements against changes in objective scale, which may highlight an advantage dealing with distribution shifts in the environment dynamics. We further compared SFELLA to the multi-objective reward exponentials (MORE) approach, and found that SFELLA performs similarly to MORE in a simple previously-described foraging task, but in a modified foraging environment with a new resource that was not depleted as the agent worked, SFELLA collected more of the new resource with very little cost incurred in terms of the old resource. Overall, we found SFELLA useful for avoiding problems that sometimes occur with a thresholded approach, and more reward-responsive than MORE while retaining its conservative, loss-averse incentive structure.

show abstract

Section: Design Principles Research Contextmentioning

confidence: 99%

Using soft maximin for risk averse multi-objective decision-making

Smith

Klassert

Roland³

2022

Auton Agent Multi-Agent Syst

View full text Add to dashboard Cite

show abstract

“…We believe it is reasonable to use MARL as a first step in exploring the use of AI tools to study multi-person social dilemmas. The current model for reinforcement learning suggests that reward maximization is sufficient to drive behavior that exhibits abilities studied in the human cooperation and social dilemmas, including "knowledge, learning, perception, social intelligence, language, generalization and imitation" (Yang, 2021;Silver et al, 2021;Vamplew et al, 2022). The justification for this claim is deeply rooted in the von Neumann Morgenstern utility theory (von Neumann and Morgenstern, 2007), which is the basis for the well-known expected utility theory (Schoemaker, 2013) and essentially states that it is safe to assume an intelligent entity will always make decisions according to the highest expected utility in any complex scenarios 1 (Yang, 2021).…”

Section: Introductionmentioning

confidence: 99%

Learning Roles with Emergent Social Value Orientations

Li¹,

Wang²,

Jin³

et al. 2023

Preprint

View full text Add to dashboard Cite

Social dilemmas can be considered situations where individual rationality leads to collective irrationality. The multi-agent reinforcement learning community has leveraged ideas from social science, such as social value orientations (SVO), to solve social dilemmas in complex cooperative tasks. In this paper, by first introducing the typical "division of labor or roles" mechanism in human society, we provide a promising solution for intertemporal social dilemmas (ISD) with SVOs. A novel learning framework, called Learning Roles with Emergent SVOs (RESVO), is proposed to transform the learning of roles into the social value orientation emergence, which is symmetrically solved by endowing agents with altruism to share rewards with other agents. An SVO-based role embedding space is then constructed by individual conditioning policies on roles with a novel rank regularizer and mutual information maximizer. Experiments show that RESVO achieves a stable division of labor and cooperation in ISDs with different complexity.

show abstract

“…Reinforcement learning (RL) is a mechanism for an agent to maximize expected reward by calibrating behavior to match behaviors that have been reinforced with reward (or punishment) in the past ( Sutton et al, 1992 ). RL has directly measurable signals in neural circuitry ( Schultz et al, 1997 ), has been foundational for the development of our understanding of human learning in general ( Shteingart & Loewenstein, 2014 ), and not only underpins human learning but also seems fundamental for the development of human-level artificial general intelligence ( Ide et al, 2022 ; Silver et al, 2021 ; Vamplew et al, 2022 ). RL is also important in the development of appropriate response inhibition, which plays a key role in goal-directed behavior ( Berkman, 2018 ; Verbruggen & Logan, 2008 ), psychopathological conditions ( Howlett et al, 2023 ), and in inhibitory response training for reducing unhealthy food intake ( Houben, 2011 ; Lawrence et al, 2015 ).…”

Section: Introductionmentioning

confidence: 99%

Striatal response to negative feedback in a stop signal task operates as a multi-value learning signal

Smith,

Lipsett,

Cosme

et al. 2023

Imaging Neuroscience

Self Cite

View full text Add to dashboard Cite

BACKGROUND AND AIM: We examined error-driven learning in fMRI activity of 217 subjects in a stop signal task to obtain a more robust characterization of the relation between behavioral measures of learning and corresponding neural learning signals than previously possible. METHODS: The stop signal task is a two-alternative forced choice in which participants respond to an arrow by pressing a left or right button but must inhibit that response on 1 in 7 trials when cued by an auditory “stop signal.” We examined post-error learning by comparing brain activity (BOLD signal) and behavioral responses on trials preceded by successful (correct stop) versus failed (failed stop) inhibition. RESULTS: There was strong evidence of greater bilateral striatal activity in the period immediately following correct (versus failed) stop trials (most evident in the putamen; peak MNI coordinates [-26 8 -2], 430 voxels, p<0.001; [24 14 0], 527 voxels, p<0.001). We measured median activity in the bilateral striatal cluster following every failed stop and correct stop trial and correlated it with learning signals for (a) probability and (b) latency of the Stop signal. In a mixed-effect model predicting activity 5-10 s after the stop signal, both RT change (B=-0.05, t=3.0, χ2=11.3, p<0.001) and probability of stop trial change (B=1.53, t=6.0, χ2=43.0, p<0.001) had significant within-subjects effects on median activity. In a similar mixed model predicting activity 1-5 s after the stop signal, only probability of stop trial change was predictive. CONCLUSIONS. A mixed-effects model indicates the striatal activity might be a learning signal that encodes reaction time change and the current expected probability of a stop trial occuring. This extends existing evidence that the striatum encodes a reward prediction error signal for learning within the stop signal task, and demonstrates for the first time that this signal seems to encode both change in stop signal probability and in stop signal delay.

show abstract

Scalar reward is not enough: a response to Silver, Singh, Precup and Sutton (2021)

Cited by 18 publications

References 74 publications

Using soft maximin for risk averse multi-objective decision-making

Using soft maximin for risk averse multi-objective decision-making

Learning Roles with Emergent Social Value Orientations

Striatal response to negative feedback in a stop signal task operates as a multi-value learning signal

Contact Info

Product

Resources

About