We consider a broad class of stochastic dynamic programming problems that are amenable to relaxation via decomposition. These problems comprise multiple subproblems that are independent of each other except for a collection of coupling constraints on the action space. We fit an additively separable value function approximation using two techniques, namely, Lagrangian relaxation and the linear programming (LP) approach to approximate dynamic programming. We prove various results comparing the relaxations to each other and to the optimal problem value. We also provide a column generation algorithm for solving the LP-based relaxation to any desired optimality tolerance, and we report on numerical experiments on bandit-like problems. Our results provide insight into the complexity versus quality trade-off when choosing which of these relaxations to implement.
Inventory record inaccuracy is a significant problem for retailers using automated inventory management systems. In this paper, we consider an intelligent inventory management tool that accounts for record inaccuracy using a Bayesian belief of the physical inventory level. We assume that excess demands are lost and unobserved, in which case sales data reveal information about physical inventory levels. We show that a probability distribution on physical inventory levels is a sufficient summary of past sales and replenishment observations, and that this probability distribution can be efficiently updated in a Bayesian fashion as observations are accumulated. We also demonstrate the use of this distribution as the basis for practical replenishment and inventory audit policies and illustrate how the needed parameters can be estimated using data from a large national retailer. Our replenishment policies avoid the problem of "freezing," in which a physical inventory position persists at zero while the corresponding record is positive. In addition, simulation studies show that our replenishment policies recoup much of the cost of inventory record inaccuracy, and that our audit policy significantly outperforms the popular "zero balance walk" audit policy.retail execution, inventory control, record inaccuracy, inventory shrinkage, Bayes rule
We consider a multiarmed bandit problem where the expected reward of each arm is a linear function of an unknown scalar with a prior distribution. The objective is to choose a sequence of arms that maximizes the expected total (or discounted total) reward. We demonstrate the effectiveness of a greedy policy that takes advantage of the known statistical correlation structure among the arms. In the infinite horizon discounted reward setting, we show that the greedy and optimal policies eventually coincide, and both settle on the best arm. This is in contrast with the Incomplete Learning Theorem for the case of independent arms. In the total reward setting, we show that the cumulative Bayes risk after periods under the greedy policy is at most (log), which is smaller than the lower bound of (log 2) established by Lai [1] for a general, but different, class of bandit problems. We also establish the tightness of our bounds. Theoretical and numerical results show that the performance of our policy scales independently of the number of arms. Index Terms-Markov decision process (MDP). I. INTRODUCTION I N the multiarmed bandit problem, a decision-maker samples sequentially from a set of arms whose reward characteristics are unknown to the decision-maker. The distribution of the reward of each arm is learned from accumulated experience as the decision-maker seeks to maximize the expected total (or discounted total) reward over a horizon. The problem has garnered significant attention as a prototypical example of the so-called exploration versus exploitation dilemma, where a decision-maker balances the incentive to exploit the arm with the highest expected payoff with the incentive to explore poorly understood arms for information-gathering purposes.
A growing segment of the revenue management and pricing literature assumes "strategic" customers who are forward-looking in their pursuit of utility. Recognizing that such behavior may not be directly observable by a seller, we examine the implications of seller uncertainty over strategic customer behavior in a markdown pricing setting. We assume that some proportion of customers purchase impulsively in the first period if the price is below their willingness to pay, while other customers strategically wait for lower prices in the second period. We consider a two-period selling season in which the seller knows the aggregate demand curve but not the proportion of customers behaving strategically. We show that a robust pricing policy that requires no knowledge of the extent of strategic behavior performs remarkably well. We extend our model to a setting with stochastic demand, and show that the robust pricing policy continues to perform well, particularly as capacity is loosened or the problem is scaled up. Our results underscore the need to recognize strategic behavior, but also suggest that in many cases effective performance is possible without precise knowledge of strategic behavior.
When a marketer in an interactive environment decides which messages to send to her customers, she may send messages currently thought to be most promising (exploitation) or use poorly understood messages for the purpose of information gathering (exploration). We assume that customers are already clustered into homogeneous segments, and we consider the adaptive learning of message effectiveness within a customer segment. We present a Bayesian formulation of the problem in which decisions are made for batches of customers simultaneously, although decisions may vary within a batch. This extends the classical multiarmed bandit problem for sampling one-by-one from a set of reward populations. Our solution methods include a Lagrangian decomposition-based approximate dynamic programming approach and a heuristic based on a known asymptotic approximation to the multiarmed bandit solution. Computational results show that our methods clearly outperform approaches that ignore the effects of information gain.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.