2021
DOI: 10.48550/arxiv.2103.01312
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

UCB Momentum Q-learning: Correcting the bias without forgetting

Abstract: We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new algorithm for reinforcement learning in tabular and possibly stagedependent, episodic Markov decision process. UCBMQ is based on Q-learning where we add a momentum term and rely on the principle of optimism in face of uncertainty to deal with exploration. Our new technical ingredient of UCBMQ is the use of momentum to correct the bias that Q-learning suffers while, at the same time, limiting the impact it has on the the second-order term of th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
8
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(9 citation statements)
references
References 5 publications
1
8
0
Order By: Relevance
“…The runtime of Q-EarlySettled-Advantage is no larger than O(T ), which is proportional to the time taken to read the samples. This matches the computational cost of the model-free algorithm UCB-Q proposed in Jin et al (2018a), and is considerably lower than that of the UCB-M-Q algorithm in Menard et al (2021) (which has a computational cost of at least O(ST )).…”
Section: Resultssupporting
confidence: 85%
See 3 more Smart Citations
“…The runtime of Q-EarlySettled-Advantage is no larger than O(T ), which is proportional to the time taken to read the samples. This matches the computational cost of the model-free algorithm UCB-Q proposed in Jin et al (2018a), and is considerably lower than that of the UCB-M-Q algorithm in Menard et al (2021) (which has a computational cost of at least O(ST )).…”
Section: Resultssupporting
confidence: 85%
“…This is basically un-improvable for the tabular case, since even storing the optimal Q-values alone takes O(SAH) units of space. In comparison, while Menard et al (2021) also accommodates the sample size range (17), the algorithm proposed therein incurs a space complexity of O(S 2 AH) that is S times higher than ours.…”
Section: Resultsmentioning
confidence: 92%
See 2 more Smart Citations
“…Our work is also closely related to another line of work on value-based methods. In particular, Azar et al (2017); Zanette and Brunskill (2019); Zhang et al (2020a,b); Menard et al (2021) have shown that the value-based methods can achieve O( √ SAH 3 K) regret upper bound, which matches the information theoretic limit. Different from these works, we are the first to prove the (nearly) optimal regret bound for policy-based methods.…”
Section: Related Workmentioning
confidence: 78%