The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.
We propose a stochastic approximation (SA) based method with randomization of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our proposed scheme is equivalent to running regular temporal difference learning with linear function approximation, albeit with samples picked uniformly from a given dataset. Our method results in an O(d) improvement in complexity in comparison to LSTD, where d is the dimension of the data. We provide non-asymptotic bounds for our proposed method, both in high probability and in expectation, under the assumption that the matrix underlying the LSTD solution is positive definite. The latter assumption can be easily satisfied for the pathwise LSTD variant proposed by Lazaric (J Mach Learn Res 13:3041-3074, 2012). Moreover, we also establish that using our method in place of LSTD does not impact the rate of convergence of the approximate value function to the true value function. These rate results coupled with the low computational complexity of our method make it attractive for implementation in big data settings, where d is large. A similar low-complexity alternative for least squares regression is well-known as the stochastic gradient descent (SGD) algorithm. We provide finite-time bounds for SGD. We demonstrate the practicality of our method as an efficient alternative for pathwise LSTD empirically by combining it with the least squares policy iteration algorithm in a traffic signal control application. We also conduct another set of experiments that combines the SA-based low-complexity variant for least squares regression with the LinUCB algorithm for contextual bandits, using the large scale news recommendation dataset from Yahoo.
We propose a stochastic approximation based method with randomisation of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our method results in an O(d) improvement in complexity in comparison to regular LSTD, where d is the dimension of the data. We provide convergence rate results for our proposed method, both in high probability and in expectation. Moreover, we also establish that using our scheme in place of LSTD does not impact the rate of convergence of the approximate value function to the true value function. This result coupled with the low complexity of our method makes it attractive for implementation in big data settings, where d is large. Further, we also analyse a similar low-complexity alternative for least squares regression and provide finite-time bounds there. We demonstrate the practicality of our method for LSTD empirically by combining it with the LSPI algorithm in a traffic signal control application.Several machine learning problems involve solving a linear system of equations from a given set of training data. In this paper we consider the problem of policy evaluation in reinforcement learning (RL) using the method of temporal differences (TD). Given a fixed training data set, one popular temporal difference algorithm for policy evaluation is LSTD Bradtke and Barto [1996]. However, LSTD is computationally expensive as it requires O(d 2 ) computations. We propose a stochastic approximation (SA) based algorithm that draws data samples from a uniform distribution on the training set. From the finite time analyses that we provide, we observe our algorithm converges at the optimal rate, in high probability as well as in expectation. Moreover, using our scheme in place of LSTD does not impact the rate of convergence of the approximate value function to the true value function. This finding coupled with the significant decrease in the computational cost of our algorithm, makes it appealing in the canonical big data settings.The problem considered here is to estimate the value function V π of a given policy π. Temporal difference (TD) methods are well-known in this context, and they are known to converge to the fixed point, where T π is the Bellman operator (see Section 2.1 for a precise definition). A popular approach to overcome the curse of dimensionality associated with large state spaces is to parameterize the value function using a linear function approximation architecture. For every s in the state space S, we approximate V π (s) ≈ θ T φ(s), where φ(·) is a d-dimensional feature vector with d << |S|, and θ is a tunable parameter. The function approximation variant of TD Tsitsiklis and Van Roy [1997] is known to converge to the fixed point of Φθ = ΠT π (Φθ), where Π is the orthogonal projection onto the space within which we approximate the value function, and Φ is the feature matrix that characterises this space.LSTD estimates the fixed point of ΠT π using empirical data D := {(s i , r i , s i ), i = 1, . . . , T )} obtained by simulating t...
Online learning algorithms require to often recompute least squares regression estimates of parameters. We study improving the computational complexity of such algorithms by using stochastic gradient descent (SGD) type schemes in place of classic regression solvers. We show that SGD schemes efficiently track the true solutions of the regression problems, even in the presence of a drift. This finding coupled with an O(d) improvement in complexity, where d is the dimension of the data, make them attractive for implementation in the big data settings. In the case when strong convexity in the regression problem is guaranteed, we provide bounds on the error both in expectation and high probability (the latter is often needed to provide theoretical guarantees for higher level algorithms), despite the drifting least squares solution. As an example of this case we prove that the regret performance of an SGD version of the PEGE linear bandit algorithm is worse than that of PEGE itself only by a factor of O(log 4 n). When strong convexity of the regression problem cannot be guaranteed, we investigate using an adaptive regularisation. We make an empirical study of an adaptively regularised, SGD version of LinUCB in a news article recommendation application, which uses the large scale news recommendation dataset from Yahoo! front page. These experiments show a large gain in computational complexity and a consistently low tracking error.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.