2022
DOI: 10.1371/journal.pcbi.1010350
|View full text |Cite
|
Sign up to set email alerts
|

Asymmetric and adaptive reward coding via normalized reinforcement learning

Abstract: Learning is widely modeled in psychology, neuroscience, and computer science by prediction error-guided reinforcement learning (RL) algorithms. While standard RL assumes linear reward functions, reward-related neural activity is a saturating, nonlinear function of reward; however, the computational and behavioral implications of nonlinear RL are unknown. Here, we show that nonlinear RL incorporating the canonical divisive normalization computation introduces an intrinsic and tunable asymmetry in prediction err… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
13
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 12 publications
(19 citation statements)
references
References 52 publications
1
13
0
Order By: Relevance
“…Define V (1) ∝ 1 x (1) x , and then, for t > 1V(t)normalθt1V(1)+true∑k=2tnormalθtk1x(normalκ)×true(xnormalω+x+truetrue∑u=1k1θku1x(u)xtrue) where ω is a positive constant reflecting saturation in neuronal responses. In this expression for the Pavlovian learned value V ( t ), which was inspired by ( 36 ), reward value is normalized by the recent reward history u=1k1normalθku1xfalse(ufalse)x , to account for the well-evidenced phenomenon of “divisive normalization” ( 37 ). The basic idea is that the perception of a stimulus depends not on its absolute intensity but rather on its intensity relative to what the agent expects (which is assumed to depend on the recent outcome history).…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Define V (1) ∝ 1 x (1) x , and then, for t > 1V(t)normalθt1V(1)+true∑k=2tnormalθtk1x(normalκ)×true(xnormalω+x+truetrue∑u=1k1θku1x(u)xtrue) where ω is a positive constant reflecting saturation in neuronal responses. In this expression for the Pavlovian learned value V ( t ), which was inspired by ( 36 ), reward value is normalized by the recent reward history u=1k1normalθku1xfalse(ufalse)x , to account for the well-evidenced phenomenon of “divisive normalization” ( 37 ). The basic idea is that the perception of a stimulus depends not on its absolute intensity but rather on its intensity relative to what the agent expects (which is assumed to depend on the recent outcome history).…”
Section: Resultsmentioning
confidence: 99%
“…where ω is a positive constant reflecting saturation in neuronal responses. In this expression for the Pavlovian learned value V(t), which was inspired by (36), reward value is normalized by the recent reward history…”
Section: Pavlovian Value and Craving Power Of A Gambling Cuementioning
confidence: 99%
“…Specifically, we extended the learning rule from DRL [15] to account not only for the relationship between asymmetry and threshold but also for the three other tuning properties predicted by the efficient code. Besides DRL, there are two alternative explanations of the data by Eshel et al [16]: the Laplace code [30] and normalized reinforcement learning [31]. In the Laplace code, TD-learning neurons with different parameters are used to encode the timing and a whole distribution of rewards in the future by representing an analogue of the Laplace transform in the neural responses.…”
Section: Discussionmentioning
confidence: 99%
“…In their formulation, they subtract the expected response from each neuron instead of changing the gain (as we predict and observe). Normalized reinforcement learning [31] proposes that RPENs perform divisive normalization with different half saturation constants. This yields sigmoid (Naka-Rushton [32]) neurons with different thresholds, and the asymmetry around those thresholds is explained again by cutting the sigmoids at different heights.…”
Section: Mainmentioning
confidence: 99%
“…We used an efficient population coding framework [22][23][24] In order to optimize this efficient population code online, we generalize the distributional learning rules (Figure 5A 6 ) to the time domain, considering multiple channels with different relative scaling for over and underestimation of reward times, that generate a diversity of learnt reward time scales (Figure 5E), Importantly, these parameters converge to the efficient code that optimally adapts to the statistics of expected reward times in the environment (see Methods).…”
Section: Value and Temporal Sensitivity Efficiently Adapt To Environm...mentioning
confidence: 99%