Appendix A provides additional background that describes the multi-armed bandit problem and the relationship of the simulation selection problem to a stoppable version of the multi-armed bandit. It also provides a numerical example that shows that the few existing results that characterize optimal policies for stoppable bandits do not apply to the simulation selection problem.Appendix B motivates the free boundary equation whose solution approximates the optimal expected discounted reward when k = 1. Appendix C provides mathematical proofs of the claims in the main paper.Appendix D describes several technical extensions that expand the range of validity of the paper. It relaxes some assumptions about the independence of the output from a single system, as well as the duration of the replications for each alternative.Appendix E summarizes how the optimal expected discounted reward (OEDR) and stopping boundaries for the simulation selection problem with k = 1 alternative were computed. Appendix F specifies the simulation selection procedures that are used in §6.3.
Appendix A: Supplement: Multi-Armed Bandits and the Simulation Selection ProblemThe simulation selection problem is closely related to a class of sequential decision problem known as the multi-armed bandit problem. In this section, we review relevant theory, and we apply the theory to demonstrate that simulation selection problems can be reduced to a variation of multi-armed bandits that is called a stoppable bandit problem. We then present a numerical example that indicates that well-known sufficient conditions, used to justify the optimality of indexed-based rules in stoppable-bandit problems, do not hold in our case.
A.1. The Multi-Armed Bandit ProblemThis section supplements the discussion in §3 by providing formal definitions of the multi-armed bandit problem and of optimal allocation index rules.In the discounted multi-armed bandit problem, a decision-maker chooses repeatedly among a finite set of mutuallyindependent Markov chains that are indexed i = 1, 2, . . . , k. A choice of chain i at stage t yields an expected reward that is specific to the state of chain i, and it initiates a state transition for chain i. The k − 1 chains not chosen at stage t remain in their current states and earn no rewards. The objective is to maximize the expected sum of discounted rewards over an infinite horizon (Gittins 1989).For the case in which expected one-period rewards are bounded for each chain, Gittins and co-workers proved that an index can be computed for each arm, independently of all other arms, such that it is optimal to select the arm whose index is greatest among all arms. This allocation index has come be known as a "Gittins index."Formally, we define the multi-armed bandit's parameters as follows. Markov chain i has state space Ω Θ i , with states Θ i ∈ Ω Θ i . The state space has σ-algebra, F i , of subsets of Ω Θ i , which includes all elements Θ i ∈ Ω Θ i . We define the product space of joint outcomes across all k Markov chains as (Ω, F). If cha...