2021
DOI: 10.48550/arxiv.2106.14352
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Instance-optimality in optimal value estimation: Adaptivity via variance-reduced Q-learning

Abstract: Various algorithms in reinforcement learning exhibit dramatic variability in their convergence rates and ultimate accuracy as a function of the problem structure. Such instance-specific behavior is not captured by existing global minimax bounds, which are worst-case in nature. We analyze the problem of estimating optimal Q-value functions for a discounted Markov decision process with discrete states and actions and identify an instance-dependent functional that controls the difficulty of estimation in the 8 -n… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

3
39
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 8 publications
(42 citation statements)
references
References 5 publications
3
39
0
Order By: Relevance
“…, this matches the two-point lower bound in the paper [KXWJ21] (in the discounted MDP case). When specializing to the cases where the optimal policy is unique, or satisfies the Lipschitz-type assumptions in the paper [KXWJ21], the upper bound above also recovers the leading-order term in that paper. We conjecture that the leading-order term of the solution s n to the fixed-point equation is actually optimal for large n. It is an important direction of future work to investigate this gap, and establish optimality results under suitably defined problem classes.…”
Section: Guarantees For Stochastic Shortest Pathmentioning
confidence: 55%
See 4 more Smart Citations
“…, this matches the two-point lower bound in the paper [KXWJ21] (in the discounted MDP case). When specializing to the cases where the optimal policy is unique, or satisfies the Lipschitz-type assumptions in the paper [KXWJ21], the upper bound above also recovers the leading-order term in that paper. We conjecture that the leading-order term of the solution s n to the fixed-point equation is actually optimal for large n. It is an important direction of future work to investigate this gap, and establish optimality results under suitably defined problem classes.…”
Section: Guarantees For Stochastic Shortest Pathmentioning
confidence: 55%
“…Note that when specialized to the γ-discounted MDPs, the sample size requirement in Corollary 5 becomes O (1 − γ) −4 , which can be worse than the corresponding requirements in the paper [KXWJ21], at least in certain regimes. Intuitively, this is the price we pay when moving to the general case where only the contraction of the population-level operator is assumed, instead of the sample-level contraction.…”
Section: Guarantees For Stochastic Shortest Pathmentioning
confidence: 89%
See 3 more Smart Citations