Consider the problem of sequential sampling from m statistical populations to maximize the expected sum of outcomes in the long run. Under suitable assumptions on the unknown parameters g ⌰, it is shown that there exists a class C of R Ž. adaptive policies with the following properties: i The expected n horizon reward 0 n UF Policies in C are specified via easily computable indices, defined as unique R Ž. solutions to dual problems that arise naturally from the functional form of M. In addition, the assumptions are verified for populations specified by nonparametric discrete univariate distributions with finite support. In the case of normal populations with unknown means and variances, we leave as an open problem the verification of one assumption.
The multi-armed bandit problem arises in sequentially allocating effcot to one of N prefects and sequentially asngning patients to cme of N treatmoits in dinical trials. Gittins ainl Jones (1974) have shown that oae. optintal policy Ua the JV-pn^ect problem, an A^dimensional discounted Maricov dedskm chain, is detennined by tiK following largest-index nde. There is an index for eadi state of eadi given project that depoids oidy on die data of that prcgect In each period one allocates effect to a prcgect with largest current index. The purpose (A this paper is to give a short pnx>f of this result and a new diaracterization of tbe index of a project in state i, viz., as the nmrimiim expected present value in state i for the restart-in-(problem in which, in eadi sute ai«i poiod, mie either amtinues allocating effc»t to the project or immediately restarts the project in state i. lAoKovei, it is shown that an approximate largest-index rule yields an approximatdy optinal parse transition matrices in large state qjaces than have been sugg^ted luretofore. By using a suitable inq>]ementation of successive a|>proximati<»is, a pcdicy whose expected present value is within 100c % of the maximum possible range ctf values of the indices can be found on-liiK with at most (N + T-l)TM q>erations where M is the number of operations required to calculate one approximation, T is the least integer nuycoizing the ratio In e/ln a and 0 < a < 1 is the discount factor.
In this paper we consider the problem of adaptive control for Markov Decision Processes. We give the explicit form for a class of adaptive policies that possess optimal increase rate properties for the total expected finite horizon reward, under sufficient assumptions of finite state-action spaces and irreducibility of the transition law. A main feature of the proposed policies is that the choice of actions, at each state and time period, is based on indices that are inflations of the right-hand side of the estimated average reward optimality equations.
We consider the problem of sampling sequentially from two or more populations in such a way as to maximize the expected sum of outcomes in the long run.
A class of Markov chains we call successively lumpable is specified for which it is shown that the stationary probabilities can be obtained by successively computing the stationary probabilities of a propitiously constructed sequence of Markov chains. Each of the latter chains has a(typically much) smaller state space and this yields significant computational improvements. We discuss how the results for discrete time Markov chains extend to semi-Markov processes and continuous time Markov processes. Finally, we will study applications of successively lumpable Markov chains to classical reliability and queueing models.
This paper defines and studies the down entrance state (DES) and the restart entrance state (RES) classes of quasi-skip free (QSF) processes specified in terms of the nonzero structure of the elements of their transition rate matrix Q. A QSF process is a Markov chain with states that can be specified by tuples of the form (m, i), where m ∈ Z represents the "current" level of the state and i ∈ Z + the current phase of the state, and its transition probability matrix Q does not permit one-step transitions to states that are two or more levels away from the current state in one direction of the level variable m. A QSF process is a DES process if and only if one step "down" transitions from a level m can only reach a single state in level m − 1, for all m. A QSF process is a RES process if and only if one step "up" transitions from a level m can only reach a single set of states in the highest level M 2 , largest of all m.We derive explicit solutions and simple truncation bounds for the steady-state probabilities of both DES and RES processes, when in addition Q insures ergodicity. DES and RES processes have applications in many areas of applied probability comprising computer science, queueing theory, inventory theory, reliability, and the theory of branching processes. To motivate their applicability we present explicit solutions for the well-known open problem of the M/Er/n queue with batch arrivals, an inventory model, and a reliability model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.