Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Azar, Mohammad Gheshlaghi; Munos, Rémi; Kappen, Hilbert J.

doi:10.1007/s10994-013-5368-1

Cited by 119 publications

(184 citation statements)

References 12 publications

(10 reference statements)

Supporting

Mentioning

171

Contrasting

Order By: Relevance

“…It is interesting to compare the upper bound (27) with a worst case upper bound on Σ PolOpt pr, P P P, γq. In light of Lemma 7 from the paper [AMK13], assuming R max ď 1, and }r} 8 ď 1 for simplicity, we have Σ PolOpt pr, P P P, γq ď 1 p1 ´γq 1.5 .…”

Section: A Conservative Yet Useful Upper Boundmentioning

confidence: 98%

“…Much of the focus in the past has been on understanding TD-type algorithms with instancedependent analyses: function approximation under the 2 error [BRS18, DBGM18, XWZL20], tabular setting under the 8 -error [KXWJ21,PW21], or under kernel function approximation [DWW21]. Many of these results established instance-specific guarantees that improve upon global worst-case bounds [AMK13]. In particular, the paper [KPR `21] establishes a local minimax lower-bound in the tabular setting and proposes a procedure that achieves it.…”

Section: Related Workmentioning

confidence: 99%

“…There exists a variety of different techniques for solving policy optimization; the one of interest here is Q-learning, introduced in the paper [WD92]. There has been much prior work on the theory of Q-learning, such as convergence of the standard updates [Wai19a, LCC `21], global minimax lower bounds for estimation of optimal Q-functions [AMK13], variance-reduced versions of Q-learning and their worst-case optimality [SWWY18, SWW `18, Wai19b], and the asynchronous setting [LWC `20]. The paper [KXWJ21] establishes the local non-asymptotic minimax lower bound of estimating the Q-function and proves that variance-reduced Q-learning achieves it.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Instance-Dependent Confidence and Early Stopping for Reinforcement Learning

Khamaru¹,

Xia²,

Wainwright³

et al. 2022

Preprint

View full text Add to dashboard Cite

Various algorithms for reinforcement learning (RL) exhibit dramatic variation in their convergence rates as a function of problem structure. Such problem-dependent behavior is not captured by worst-case analyses and has accordingly inspired a growing effort in obtaining instance-dependent guarantees and deriving instance-optimal algorithms for RL problems. This research has been carried out, however, primarily within the confines of theory, providing guarantees that explain ex post the performance differences observed. A natural next step is to convert these theoretical guarantees into guidelines that are useful in practice. We address the problem of obtaining sharp instance-dependent confidence regions for the policy evaluation problem and the optimal value estimation problem of an MDP, given access to an instance-optimal algorithm. As a consequence, we propose a data-dependent stopping rule for instance-optimal algorithms. The proposed stopping rule adapts to the instance-specific difficulty of the problem and allows for early termination for problems with favorable structure.

show abstract

Section: A Conservative Yet Useful Upper Boundmentioning

confidence: 98%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Instance-Dependent Confidence and Early Stopping for Reinforcement Learning

Khamaru¹,

Xia²,

Wainwright³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In 1981, Chile privatized its traditional public pension system and created a fully funded (FF) private pension system (PPS) with pension fund managers, Administradoras de Fondos de Pensiones (AFP), that were exclusively designated to invest workers' retirement savings (Holzmann & Stiglitz, 2001;Piñera, 1991). The Chilean reform shut down the national public pension system (NPS) 3 and diverged from the International Labour Organization's (ILO) social protection principles (ILO, 1952(ILO, , 2012a(ILO, , 2012b. The impetus for similar reforms in other Latin American countries was supported by the World Bank (WB) and the Inter-American Development Bank (IADB), in efforts to reduce the public pension debt due to aging and to address the sustainability of pay-as-you-go (PAYG) (World Bank, 1994).…”

Section: Structural Reforms and Re-reforms In Latin Americamentioning

confidence: 99%

“…In Latin America, 13 countries, including Peru, have adopted social pensions to mitigate poverty in old age (ECLAC, 2019). Overall, these transfers help pay living expenses, but they do not ensure the wellness of older people (ILO, 2014;Olivera, 2016b;Rofman, Apella, & Vezza, 2014). Such individuals' needs go beyond income and include access to health care, optimal housing, transportation, and other services (UN, 2002).…”

Section: Structural Reforms and Re-reforms In Latin Americamentioning

confidence: 99%

The pension system in Peru: Parallels and intersections

Saco

Gil

2020

Int J Soc Welfare

View full text Add to dashboard Cite

In this article, we estimate the active and passive contributory pension coverage rates in Peru since the structural reform of social security in 1992. Further, we delineate a supply‐and‐demand model for the pension market. Using a diagram based on this model, we analyze the impact of re‐reforms that have increased the competitiveness of the pension system and in an effort to promote worker affiliation. Re‐reform has shifted the supply of pension services, but the demand for such services has remained constant at 28% of the labor force. Thus, coverage has not meaningfully increase. In the conclusion section of the article, we consider various policy interventions intended to increase demand, which might include instituting mandatory contributions for the self‐employed and/or increasing the proportion of registered employees. In addition, to mitigate income insecurity for older people, given the low active and passive coverage rates, we recommend that access to social pensions be increased.Key Practitioner Message: • Estimation of the active and the passive pension coverage trajectories since the social security structural reform. • Delineation of a conceptual framework to describe the pension market using supply and demand functions. • Alternative interventions to increase coverage, that is, strengthening the competitiveness of the pension market, shifting the demand and universal social pensions.

show abstract

Variance reduced value iteration and faster algorithms for solving Markov decision processes

Sidford

Wang

et al. 2021

Naval Research Logistics

View full text Add to dashboard Cite

In this paper we provide faster algorithms for approximately solving discounted Markov decision processes in multiple parameter regimes. Given a discounted Markov decision process (DMDP) with |S| states, |A| actions, discount factor γ ∈ (0, 1), and rewards in the range [−M, M], we show how to compute an ϵ‐optimal policy, with probability 1 − δ in time (Note: We use trueO˜$$ \tilde{O} $$ to hide polylogarithmic factors in the input parameters, that is, trueO˜(f(x))=O(f(x)⋅logfalse(ffalse(xfalse)false)O(1))$$ \tilde{O}\left(f(x)\right)=O\left(f(x)\cdot \log {\left(f(x)\right)}^{O(1)}\right) $$.) trueO˜()()|S|2|A|+false|Sfalse‖Afalse|(1−γ)3log()Mϵlog()1δ.$$ \tilde{O}\left(\left({\left|S\right|}^2\mid A\mid +\frac{\mid S\Big\Vert A\mid }{{\left(1-\gamma \right)}^3}\right)\log \left(\frac{M}{\epsilon}\right)\log \left(\frac{1}{\delta}\right)\right). $$ This contribution reflects the first nearly linear time, nearly linearly convergent algorithm for solving DMDPs for intermediate values of γ. We also show how to obtain improved sublinear time algorithms provided we can sample from the transition function in O(1) time. Under this assumption we provide an algorithm which computes an ϵ‐optimal policy for ϵ∈(]0,M1−γ$$ \epsilon \in \left(0,\frac{M}{\sqrt{1-\gamma }}\right] $$ with probability 1 − δ in time trueO˜()false|Sfalse‖Afalse|M2false(1−γfalse)4ϵ2log()1δ.$$ \tilde{O}\left(\frac{\mid S\Big\Vert A\mid {M}^2}{{\left(1-\gamma \right)}^4{\epsilon}^2}\log \left(\frac{1}{\delta}\right)\right). $$ Furthermore, we extend both these algorithms to solve finite horizon MDPs. Our algorithms improve upon the previous best for approximately computing optimal policies for fixed‐horizon MDPs in multiple parameter regimes. Interestingly, we obtain our results by a careful modification of approximate value iteration. We show how to combine classic approximate value iteration analysis with new techniques in variance reduction. Our fastest algorithms leverage further insights to ensure that our algorithms make monotonic progress towards the optimal value. This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and we hope that the analysis may be useful more broadly.

show abstract

Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

Cited by 119 publications

References 12 publications

Instance-Dependent Confidence and Early Stopping for Reinforcement Learning

Instance-Dependent Confidence and Early Stopping for Reinforcement Learning

The pension system in Peru: Parallels and intersections

Variance reduced value iteration and faster algorithms for solving Markov decision processes

Contact Info

Product

Resources

About