2012
DOI: 10.1103/physreve.85.041145
|View full text |Cite
|
Sign up to set email alerts
|

Dynamics of BoltzmannQlearning in two-player two-action games

Abstract: We consider the dynamics of Q learning in two-player two-action games with a Boltzmann exploration mechanism. For any nonzero exploration rate the dynamics is dissipative, which guarantees that agent strategies converge to rest points that are generally different from the game's Nash equlibria (NEs). We provide a comprehensive characterization of the rest point structure for different games and examine the sensitivity of this structure with respect to the noise due to exploration. Our results indicate that for… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
80
0

Year Published

2012
2012
2020
2020

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 70 publications
(85 citation statements)
references
References 26 publications
1
80
0
Order By: Relevance
“…While equilibria are reached in some instances of the prisoner’s dilemma or the coordination game [41–43], the behaviour fails to converge in others. The results depend on the reward structure of the situation [44] as well as the particular version of Q-learning [45]. …”
Section: Melioration Learningmentioning
confidence: 99%
“…While equilibria are reached in some instances of the prisoner’s dilemma or the coordination game [41–43], the behaviour fails to converge in others. The results depend on the reward structure of the situation [44] as well as the particular version of Q-learning [45]. …”
Section: Melioration Learningmentioning
confidence: 99%
“…The tradeoff between exploration and exploitation is generally a critical issue [10]. The most common method is to use a Boltzmann distribution, that is, the probability of choosing action an,jn E An at time t is given by 1A policy Yn = (Yn (an, 1 ), ... , Yn (an, J n )) is said to be stationary if it's not changing with time.…”
Section: P+pnmentioning
confidence: 99%
“…the number of iteration cycles. Based on previous experiments and recommendations [2,20,28], the values of the parameter are E 0 =E max =0.9, E min =0 a β=0.9. RL parameters and their descriptions can be found in [2,20].…”
Section: Q(prt) = Q(prt) +  ( C +  Max Q(p'r't) -Q(prt) )mentioning
confidence: 99%
“…The strategy used to select the action during learning is based on Boltzmann´s exploration [2,20,28]:…”
Section: Q(prt) = Q(prt) +  ( C +  Max Q(p'r't) -Q(prt) )mentioning
confidence: 99%