2002
DOI: 10.1023/a:1013689704352
|View full text |Cite
|
Sign up to set email alerts
|

Untitled

Abstract: Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed ba… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

7
497
0
1

Year Published

2008
2008
2016
2016

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 3,869 publications
(602 citation statements)
references
References 15 publications
7
497
0
1
Order By: Relevance
“…This information manipulation is achieved by using four forcedchoice trials, in which participants are told which option to pick, at the start of each game. We use these forced-choice trials to setup one of two information conditions: an unequal, or (AstonJones and Cohen, 2005;Badre et al, 2012), condition, in which participants see 1 play from one option and 3 plays from the other option, and an unequal, or (Auer et al, 2002;Auer et al, 2002), condition, in which participants see two outcomes from both options. By varying the amount of information participants have about each option independent of the mean payout of that option, this information manipulation allows us to remove the reward-information confound, at least on the first free-choice trial (Figure 2).…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…This information manipulation is achieved by using four forcedchoice trials, in which participants are told which option to pick, at the start of each game. We use these forced-choice trials to setup one of two information conditions: an unequal, or (AstonJones and Cohen, 2005;Badre et al, 2012), condition, in which participants see 1 play from one option and 3 plays from the other option, and an unequal, or (Auer et al, 2002;Auer et al, 2002), condition, in which participants see two outcomes from both options. By varying the amount of information participants have about each option independent of the mean payout of that option, this information manipulation allows us to remove the reward-information confound, at least on the first free-choice trial (Figure 2).…”
Section: Resultsmentioning
confidence: 99%
“…Once the payoffs of each option, R i t , have been estimated from the outcomes of the forced-choice trials, the model makes a decision using a simple logistic choice rule: ) is the difference in expected reward between left and right options and DI is the difference in information between left and right options (which we define as +1 when left is more informative, À1 when right is more informative, and 0 when both options convey equal information in the (Auer et al, 2002;Auer et al, 2002) condition). The three free parameters of the decision process are: the information bonus, A, the spatial bias, B, and the decision noise s. We assume that these three decision parameters can take on different values in the different horizon and uncertainty conditions (with the proviso that A is undefined in the (Auer et al, 2002;Auer et al, 2002) information condition since DI ¼ 0).…”
Section: Decision Componentmentioning
confidence: 99%
“…In this case, It is known that optimal algorithms for the MBP, defined by Auer et al, have a regret proportional to N log( ) [17]. The regret has no finite upper bound as N increases because it continues to require playing the lower-reward machine to ensure that the probability of incorrect judgment goes to zero.…”
Section: If We Define a New Variable S S Smentioning
confidence: 99%
“…In fact, many application problems in diverse fields, such as communications (cognitive networks [7,8]), commerce (advertising on the web [9]), entertainment (Monte Carlo tree search, which is used for computer games [10,11]), can be reduced to MBPs. Particularly, the 'upper confidence bound 1 (UCB1) algorithm' for solving MBPs is used worldwide in many practical applications [17].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation