Asymmetric and adaptive reward coding via normalized reinforcement learning

Louie, Kenway

doi:10.1371/journal.pcbi.1010350

Cited by 12 publications

(19 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Define V (1) ∝ 1 x (1) x , and then, for t > 1V(t)∝normalθt−1V(1)+true∑k=2tnormalθt−k1x(normalκ)×true(xnormalω+x+truetrue∑u=1k−1θk−u1x(u)xtrue) where ω is a positive constant reflecting saturation in neuronal responses. In this expression for the Pavlovian learned value V ( t ), which was inspired by ( 36 ), reward value is normalized by the recent reward history ∑u=1k−1normalθk−u1xfalse(ufalse)x , to account for the well-evidenced phenomenon of “divisive normalization” ( 37 ). The basic idea is that the perception of a stimulus depends not on its absolute intensity but rather on its intensity relative to what the agent expects (which is assumed to depend on the recent outcome history).…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Craving money? Evidence from the laboratory and the field

Payzan-LeNestour,

Doran

2024

Sci. Adv.

View full text Add to dashboard Cite

Continuing to gamble despite harmful consequences has plagued human life in many ways, from loss-chasing in problem gamblers to reckless investing during stock market bubbles. Here, we propose that these anomalies in human behavior can sometimes reflect Pavlovian perturbations on instrumental behavior. To show this, we combined key elements of Pavlovian psychology literature and standard economic theory into a single model. In it, when a gambling cue such as a gaming machine or a financial asset repeatedly delivers a good outcome, the agent may start engaging with the cue even when the expected value is negative. Next, we transported the theoretical framework into an experimental task and found that participants behaved like the agent in our model. Last, we applied the model to the domain of real-world financial trading and discovered an asset-pricing anomaly suggesting that market participants are susceptible to the purported Pavlovian bias.

show abstract

Section: Resultsmentioning

confidence: 99%

“…where ω is a positive constant reflecting saturation in neuronal responses. In this expression for the Pavlovian learned value V(t), which was inspired by (36), reward value is normalized by the recent reward history…”

Section: Pavlovian Value and Craving Power Of A Gambling Cuementioning

confidence: 99%

Craving money? Evidence from the laboratory and the field

Payzan-LeNestour,

Doran

2024

Sci. Adv.

View full text Add to dashboard Cite

show abstract

“…Specifically, we extended the learning rule from DRL [15] to account not only for the relationship between asymmetry and threshold but also for the three other tuning properties predicted by the efficient code. Besides DRL, there are two alternative explanations of the data by Eshel et al [16]: the Laplace code [30] and normalized reinforcement learning [31]. In the Laplace code, TD-learning neurons with different parameters are used to encode the timing and a whole distribution of rewards in the future by representing an analogue of the Laplace transform in the neural responses.…”

Section: Discussionmentioning

confidence: 99%

“…In their formulation, they subtract the expected response from each neuron instead of changing the gain (as we predict and observe). Normalized reinforcement learning [31] proposes that RPENs perform divisive normalization with different half saturation constants. This yields sigmoid (Naka-Rushton [32]) neurons with different thresholds, and the asymmetry around those thresholds is explained again by cutting the sigmoids at different heights.…”

Section: Mainmentioning

confidence: 99%

Reward prediction error neurons implement an efficient code for reward

Kim

Schütt

2022

Preprint

View full text Add to dashboard Cite

We apply efficient coding principles to derive the optimal population of neurons to encode rewards from a distribution. Similar to this optimal population, dopaminergic reward prediction error neurons have a broad distribution of optimistically placed thresholds, neurons with higher thresholds have higher gain and the curvature of their responses depends on the threshold. Thus, these neurons may broadcast an efficient reward signal, not necessarily a reward prediction error.

show abstract

“…We used an efficient population coding framework [22][23][24] In order to optimize this efficient population code online, we generalize the distributional learning rules (Figure 5A 6 ) to the time domain, considering multiple channels with different relative scaling for over and underestimation of reward times, that generate a diversity of learnt reward time scales (Figure 5E), Importantly, these parameters converge to the efficient code that optimally adapts to the statistics of expected reward times in the environment (see Methods).…”

Section: Value and Temporal Sensitivity Efficiently Adapt To Environm...mentioning

confidence: 99%

Dopamine neurons encode a multidimensional probabilistic map of future reward

Sousa,

Bujalski,

Cruz

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

Learning to predict rewards is a fundamental driver of adaptive behavior. Midbrain dopamine neurons (DANs) play a key role in such learning by signaling reward prediction errors (RPEs) that teach recipient circuits about expected rewards given current circumstances and actions. However, the algorithm that DANs are thought to provide a substrate for, temporal difference (TD) reinforcement learning (RL), learns the mean of temporally discounted expected future rewards, discarding useful information concerning experienced distributions of reward amounts and delays. Here we present time-magnitude RL (TMRL), a multidimensional variant of distributional reinforcement learning that learns the joint distribution of future rewards over time and magnitude using an efficient code that adapts to environmental statistics. In addition, we discovered signatures of TMRL-like computations in the activity of optogenetically identified DANs in mice during a classical conditioning task. Specifically, we found significant diversity in both temporal discounting and tuning for the magnitude of rewards across DANs, features that allow the computation of a two dimensional, probabilistic map of future rewards from just 450ms of neural activity recorded from a population of DANs in response to a reward-predictive cue. In addition, reward time predictions derived from this population code correlated with the timing of anticipatory behavior, suggesting the information is used to guide decisions regarding when to act. Finally, by simulating behavior in a foraging environment, we highlight benefits of access to a joint probability distribution of reward over time and magnitude in the face of dynamic reward landscapes and internal physiological need states. These findings demonstrate surprisingly rich probabilistic reward information that is learned and communicated to DANs, and suggest a simple, local-in-time extension of TD learning algorithms that explains how such information may be acquired and computed.

show abstract

Asymmetric and adaptive reward coding via normalized reinforcement learning

Cited by 12 publications

References 52 publications

Craving money? Evidence from the laboratory and the field

Craving money? Evidence from the laboratory and the field

Reward prediction error neurons implement an efficient code for reward

Dopamine neurons encode a multidimensional probabilistic map of future reward

Contact Info

Product

Resources

About