2021
DOI: 10.1016/j.cobeha.2021.04.020
|View full text |Cite
|
Sign up to set email alerts
|

Value-free reinforcement learning: policy optimization as a minimal model of operant behavior

Abstract: Reinforcement learning is a powerful framework for modelling the cognitive and neural substrates of learning and decision making. Contemporary research in cognitive neuroscience and neuroeconomics typically uses valuebased reinforcement-learning models, which assume that decision-makers choose by comparing learned values for different actions. However, another possibility is suggested by a simpler family of models, called policy-gradient reinforcement learning. Policy-gradient models learn by optimizing a beha… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
27
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 29 publications
(30 citation statements)
references
References 63 publications
0
27
0
Order By: Relevance
“…[ 25 ]) rather than via action value representations as we have modeled here. In policy learning, generalization between odor trial types would be limited as alternative actions are grouped together, separating forced-choice from free-choice trials through the presence of the unrewarded choice option in these trial-types [ 26 ].…”
Section: Discussionmentioning
confidence: 99%
“…[ 25 ]) rather than via action value representations as we have modeled here. In policy learning, generalization between odor trial types would be limited as alternative actions are grouped together, separating forced-choice from free-choice trials through the presence of the unrewarded choice option in these trial-types [ 26 ].…”
Section: Discussionmentioning
confidence: 99%
“…Although the actor-critic model is only one of a wider class of RL models, the findings of the present study and the framework of the analysis may be applicable to other models. For example, the results may have implications for algorithms that perform value estimations and policy updates in different systems, such as policy-gradient approaches, which have attracted attention (Mongillo et al 2014;Bennett et al 2021). When considering the online learning of continuous actions, such as in determining response vigor, a policygradient method based on the REINFORCE algorithm (Williams 1992) is often used (Niv 2007;Lindström et al 2021).…”
Section: Discussionmentioning
confidence: 99%
“…In such models, the action values are directly translated into weights for corresponding actions such that the higher the action value, the more likely the action is to be chosen. However, it has been noted that many psychological and neuroscientific findings are concisely explained by policy-based RL models in which the preference for each action is represented independently of the reward expectations (for a review, see Mongillo et al 2014;Bennett et al 2021).…”
Section: Introductionmentioning
confidence: 99%
“…In assessing the relative reliability of each dimension in generating the target response, it is useful to interpret model output, and thus error, in terms of response probabilities (Bennett et al, 2021). Recapitulating eqs 6 & 7, model output for each dimension D in isolation is passed through a softmax function…”
Section: Learned Attention For Cognitive Controlmentioning
confidence: 99%