We discuss the temporal-difference learning algorithm, as applied to approximating the cost-to-go function of an infinite-horizon discounted Markov chain. The algorithm we analyze updates parameters of a linear function approximator online during a single endless trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space. We present a proof of convergence (with probability one), a characterization of the limit of convergence, and a bound on the resulting approximation error. Furthermore, our analysis is based on a new line of reasoning that provides new intuition about the dynamics of temporal-difference learning.In addition to proving new and stronger positive results than those previously available, we identify the significance of online updating and potential hazards associated with the use of nonlinear function approximators. First, we prove that divergence may occur when updates are not based on trajectories of the Markov chain. This fact reconciles positive and negative results that have been discussed in the literature, regarding the soundness of temporal-difference learning. Second, we present an example illustrating the possibility of divergence when temporaldifference learning is used in the presence of a nonlinear function approximator.
The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large-scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach "fits" a linear combination of pre-selected basis functions to the dynamic programming cost-to-go function. We develop error bounds that offer performance guarantees and also guide the selection of both basis functions and "state-relevance weights" that influence quality of the approximation. Experimental results in the domain of queueing network control provide empirical support for the methodology. (Dynamic programming/optimal control: approximations/large-scale problems. Queues, algorithms: control of queueing networks.) 850 0030-364X/03/5106-0850 1526-5463 electronic ISSN J * − r 1 c = x∈ c x J * x − r x = c T J * − c T r and maximizing c T r is therefore equivalent to minimizing J * − r 1 c .
This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when learning to optimize actions such as in multiarmed bandit problems. The algorithm, also known as Thompson Sampling and as probability matching, offers significant advantages over the popular upper confidence bound (UCB) approach, and can be applied to problems with finite or infinite action spaces and complicated relationships among action rewards. We make two theoretical contributions. The first establishes a connection between posterior sampling and UCB algorithms. This result lets us convert regret bounds developed for UCB algorithms into Bayesian regret bounds for posterior sampling. Our second theoretical contribution is a Bayesian regret bound for posterior sampling that applies broadly and can be specialized to many model classes. This bound depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm Bayesian regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. Further, our analysis provides insight into performance advantages of posterior sampling, which are highlighted through simulation results that demonstrate performance surpassing recently proposed UCB algorithms.
We introduce and analyze a simulation-based approximate dynamic programming method for pricing complex American-style options, with a possibly high-dimensional underlying state space. We work within a finitely parameterized family of approximate value functions, and introduce a variant of value iteration, adapted to this parametric setting. We also introduce a related method which uses a single (parameterized) value function, which is a function of the time-state pair, as opposed to using a separate (independently parameterized) value function for each time. Our methods involve the evaluation of value functions at a finite set, consisting of "representative" elements of the state space. We show that with an arbitrary choice of this set, the approximation error can grow exponentially with the time horizon (time to expiration). On the other hand, if representative states are chosen by simulating the state process using the underlying risk-neutral probability distribution, then the approximation error remains bounded.
Abstract-The authors develop a theory characterizing optimal stopping times for discrete-time ergodic Markov processes with discounted rewards. The theory differs from prior work by its view of per-stage and terminal reward functions as elements of a certain Hilbert space. In addition to a streamlined analysis establishing existence and uniqueness of a solution to Bellman's equation, this approach provides an elegant framework for the study of approximate solutions. In particular, the authors propose a stochastic approximation algorithm that tunes weights of a linear combination of basis functions in order to approximate a value function. They prove that this algorithm converges (almost surely) and that the limit of convergence has some desirable properties. The utility of the approximation method is illustrated via a computational case study involving the pricing of a pathdependent financial derivative security that gives rise to an optimal stopping problem with a 100-dimensional state space.
Abstract. We propose information-directed sampling-a new approach to online optimization problems in which a decision maker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected single-period regret and a measure of information gain: the mutual information between the optimal action and the next observation.We establish an expected regret bound for information-directed sampling that applies across a very general class of models and scales with the entropy of the optimal action distribution. We illustrate through simple analytic examples how information-directed sampling accounts for kinds of information that alternative approaches do not adequately address and that this can lead to dramatic performance gains. For the widely studied Bernoulli, Gaussian, and linear bandit problems, we demonstrate state-of-the-art simulation performance.
In the linear programming approach to approximate dynamic programming, one tries to solve a certain linear program-the ALP-that has a relatively small number K of variables but an intractable number M of constraints. In this paper, we study a scheme that samples and imposes a subset of m M constraints. A natural question that arises in this context is: How must m scale with respect to K and M in order to ensure that the resulting approximation is almost as good as one given by exact solution of the ALP? We show that, given an idealized sampling distribution and appropriate constraints on the K variables, m can be chosen independently of M and need grow only as a polynomial in K. We interpret this result in a context involving controlled queueing networks. 1. Introduction. Due to the "curse of dimensionality," Markov decision processes typically have a prohibitively large number of states, rendering exact dynamic programming methods intractable and calling for the development of approximation techniques. This paper represents a step in the development of a linear programming approach to approximate dynamic programming (de Farias and Van Roy 2003;Schweitzer and Seidmann 1985; Zin 1993, 1997). This approach relies on solving a linear program that generally has few variables but an intractable number of constraints. In this paper, we propose and analyze a constraint sampling method for approximating the solution to this linear program. We begin in this section by discussing our working problem formulation, the linear programming approach, constraint sampling, results of our analysis, and related literature.
We propose an approximation method for analyzing Ericson and Pakes (1995)-style dynamic models of imperfect competition. We develop a simple algorithm for computing an ``oblivious equilibrium,'' in which each firm is assumed to make decisions based only on its own state and knowledge of the long run average industry state, but where firms ignore current information about competitors' states. We prove that, as the market becomes large, if the equilibrium distribution of firm states obeys a certain ``light-tail'' condition, then oblivious equilibria closely approximate Markov perfect equilibria. We develop bounds that can be computed to assess the accuracy of the approximation for any given applied problem. Through computational experiments, we find that the method often generates useful approximations for industries with hundreds of firms and in some cases even tens of firms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.