András Antos scite author profile

To cite this version:Andras Antos, Csaba Szepesvari, Rémi Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal, Springer, 2008, pp.71:89-129. The date of receipt and acceptance will be inserted by the editor Abstract We consider the problem of finding a near-optimal policy using value-function methods in continuous space, discounted Markovian Decision Problems (MDP) when only a single trajectory underlying some policy can be used as the input. Since the state-space is continuous, one must resort to the use of function approximation. In this paper we study a policy iteration algorithm iterating over action-value functions where the iterates are obtained by empirical risk minimization, where the loss function used penalizes high magnitudes of the Bellman-residual. It turns out that when a linear parameterization is used the algorithm is equivalent to least-squares policy iteration. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.

show abstract

Convergence properties of functional estimates for discrete distributions

Antos

Kontoyiannis

2001

Random Struct Algorithms

198

178

View full text Add to dashboard Cite

Suppose P is an arbitrary discrete distribution on a countable alphabet .Given an i.i.d. sample X 1 X n drawn from P, we consider the problem of estimating the entropy H P or some other functional F = F P of the unknown distribution P. We show that, for additive functionals satisfying mild conditions (including the cases of the mean, the entropy, and mutual information), the plug-in estimates of F are universally consistent. We also prove that, without further assumptions, no rate-of-convergence results can be obtained for any sequence of estimators. In the case of entropy estimation, under a variety of different assumptions, we get rate-of-convergence results for the plug-in estimate and for a nonparametric estimator based on match-lengths. The behavior of the variance and the expected error of the plug-in estimate is shown to be in sharp contrast to the finite-alphabet case. A number of other important examples of functionals are also treated in some detail.

show abstract

Online Markov Decision Processes Under Bandit Feedback

Neu

György

Szepesvári

et al. 2014

IEEE Trans. Automat. Contr.

137

110

View full text Add to dashboard Cite

International audienceWe consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in hindsight in terms of the total reward received. Specifically, in each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is an algorithm with an expected regret of O(T^2/3 ln T). In this paper, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of this algorithm (more precisely, a slightly modified version thereof) is O(T^1/2 ln T), giving the first rigorously proven, essentially tight regret bound for the problem

show abstract

Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

Antos

Szepesvári

Munos

2006

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

András Antos

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

Convergence properties of functional estimates for discrete distributions

Online Markov Decision Processes Under Bandit Feedback

Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

Contact Info

Product

Resources

About