UCB Momentum Q-learning: Correcting the bias without forgetting

Ménard, Pierre; Domingues, Omar Darwiche; Shang, Xuedong; Valko, Michal

doi:10.48550/arxiv.2103.01312

Cited by 5 publications

(9 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The runtime of Q-EarlySettled-Advantage is no larger than O(T ), which is proportional to the time taken to read the samples. This matches the computational cost of the model-free algorithm UCB-Q proposed in Jin et al (2018a), and is considerably lower than that of the UCB-M-Q algorithm in Menard et al (2021) (which has a computational cost of at least O(ST )).…”

Section: Resultssupporting

confidence: 85%

“…This is basically un-improvable for the tabular case, since even storing the optimal Q-values alone takes O(SAH) units of space. In comparison, while Menard et al (2021) also accommodates the sample size range (17), the algorithm proposed therein incurs a space complexity of O(S 2 AH) that is S times higher than ours.…”

Section: Resultsmentioning

confidence: 92%

“…• A memory-inefficient "model-free" variant. The recent work Menard et al (2021) put forward a novel sample-efficient variant of Q-learning called UCB-M-Q, which relies on a carefully chosen momentum term for bias reduction. This algorithm is guaranteed to yield near-optimal regret O √ H 2 SAT as soon as the sample size exceeds T ≥ SApoly(H), which is a remarkable improvement vis-à-vis previous regret-optimal methods (Azar et al, 2017;Zhang et al, 2020c).…”

Section: Regret-optimal Model-free Rl? a Sample Size Barriermentioning

confidence: 99%

“…As can be seen from Table 1, the space complexity of the proposed algorithm is O(SAH), which is far more memory-efficient than both the model-based approach in Azar et al (2017) and the UCB-M-Q algorithm in Menard et al (2021) (both of these prior algorithms require S 2 AH units of space). In addition, the sample size requirement T ≥ SA poly(H) of our algorithm improves -by a factor of at least S 5 A 3upon that of any prior algorithm that is simultaneously regret-optimal and memory-efficient.…”

Section: A Glimpse Of Our Contributionsmentioning

confidence: 99%

See 3 more Smart Citations

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

Li¹,

Shi²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with S states, A actions and horizon length H, substantial progress has been achieved towards characterizing the minimax-optimal regret, which scales on the order of √ H 2 SAT (modulo log factors) with T the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g., S 6 A 4 poly(H) for existing model-free methods).To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity O(SAH), that achieves near-optimal regret as soon as the sample size exceeds the order of SA poly(H). In terms of this sample size requirement (also referred to the initial burnin cost), our method improves -by at least a factor of S 5 A 3 -upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration-exploitation trade-offs.

show abstract

Section: Resultssupporting

confidence: 85%

Section: Resultsmentioning

confidence: 92%

Section: Regret-optimal Model-free Rl? a Sample Size Barriermentioning

confidence: 99%

Section: A Glimpse Of Our Contributionsmentioning

confidence: 99%

See 2 more Smart Citations

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

Li¹,

Shi²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Our work is also closely related to another line of work on value-based methods. In particular, Azar et al (2017); Zanette and Brunskill (2019); Zhang et al (2020a,b); Menard et al (2021) have shown that the value-based methods can achieve O( √ SAH 3 K) regret upper bound, which matches the information theoretic limit. Different from these works, we are the first to prove the (nearly) optimal regret bound for policy-based methods.…”

Section: Related Workmentioning

confidence: 78%

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Wu¹,

Yang²,

Han³

et al. 2021

Preprint

View full text Add to dashboard Cite

Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in Shani et al. ( 2020) is only O( √ S 2 AH 4 K) where S is the number of states, A is the number of actions, H is the horizon, and K is the number of episodes, and there is a √ SH gap compared with the information theoretic lower bound Ω( √ SAH 3 K) (Jin et al., 2018). To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (RPO-SAT), which features the property "Stable at Any Time". We prove that our algorithm achieves O( √ SAH 3 K + √ AH 4 K) regret. When S > H, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.

show abstract

Q-learning for single-agent and multi-agent and its application

Yan²

2022

2nd International Conference on Artificial Intelligence, Automation, and High-Performance Computing (AIAHPC 2022)

View full text Add to dashboard Cite

Q-learning is a reinforcement learning method for solving Markov decision problems with incomplete information proposed by Watkins. With the develop of reinforcement learning, more and more Q-learning related algorithms have been proposed, and their application range has become wider. In this paper, we discussed single agent algorithms including basic Q learning, deep Q learning and double Q learning. In addition, we discussed multi-agent algorithms including modular Q learning, ant Q learning and Nash Q learning with prominent characteristics. This paper will compare their advantages and disadvantages, and put forward our own views on the current application of Q-learning and the future trend of Q-learning.

show abstract

UCB Momentum Q-learning: Correcting the bias without forgetting

Cited by 5 publications

References 5 publications

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

Q-learning for single-agent and multi-agent and its application

Contact Info

Product

Resources

About