2020
DOI: 10.48550/arxiv.2007.11849
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

Abstract: We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal O( √ T ) regret and another computationally efficient variant with O(T 3 4 ) regret, where T is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another effici… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
13
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(14 citation statements)
references
References 8 publications
(18 reference statements)
1
13
0
Order By: Relevance
“…In this section, we show that the estimation error condition in Assumption 5.2 (and thus O( √ T ) regret) can be achieved under similar assumptions as in Abbasi-Yadkori et al (2019) and Wei et al (2020a), which we state next. Assumption 6.1 (Linear value functions).…”
Section: Linear Value Functionssupporting
confidence: 52%
See 3 more Smart Citations
“…In this section, we show that the estimation error condition in Assumption 5.2 (and thus O( √ T ) regret) can be achieved under similar assumptions as in Abbasi-Yadkori et al (2019) and Wei et al (2020a), which we state next. Assumption 6.1 (Linear value functions).…”
Section: Linear Value Functionssupporting
confidence: 52%
“…Hao et al (2020) improve these results to O(T 2/3 ). More recently, Wei et al (2020a) present three algorithms for average-reward MDPs with linear function approximation. Among these, FOPO achieves O( √ T ) regret but is computationally inefficient, and OLSVI.FH is efficient but obtains O(T 3/4 ) regret.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…To overcome the curse of large state space, function approximation has been used to design practically successful algorithms (Singh et al, 1995;Mnih et al, 2015;Bertsekas, 2018). However, most existing studies on learning infinite-horizon average-reward MDPs are limited to tabular MDPs, with only a few exceptions (Abbasi-Yadkori et al, 2019a,b;Hao et al, 2020;Wei et al, 2020). More specifically, Abbasi-Yadkori et al (2019a,b); Hao et al (2020) studied RL with function approximation for infinite-horizon average-reward MDPs under strong assumptions such as uniformly-mixing and uniformly excited feature, and proved sublinear regrets.…”
Section: Introductionmentioning
confidence: 99%