Markov decision processes (MDPs) are a popular model for performance analysis and optimization of stochastic systems. The parameters of stochastic behavior of MDPs are estimates from empirical observations of a system; their values are not known precisely. Different types of MDPs with uncertain, imprecise or bounded transition rates or probabilities and rewards exist in the literature.Commonly, analysis of models with uncertainties amounts to searching for the most robust policy which means that the goal is to generate a policy with the greatest lower bound on performance (or, symmetrically, the lowest upper bound on costs). However, hedging against an unlikely worst case may lead to losses in other situations. In general, one is interested in policies that behave well in all situations which results in a multi-objective view on decision making.In this paper, we consider policies for the expected discounted reward measure of MDPs with uncertain parameters. In particular, the approach is defined for bounded-parameter MDPs (BMDPs) [8]. In this setting the worst, best and average case performances of a policy are analyzed simultaneously, which yields a multi-scenario multi-objective optimization problem. The paper presents and evaluates approaches to compute the pure Pareto optimal policies in the value vector space.
CCS Concepts• Theory of computation → Theory and algorithms for application domains; • Applied computing → Operations research; ACM Reference Format:
We consider Markov decision processes with uncertain transition probabilities and two optimization problems in this context: the finite horizon problem which asks to find an optimal policy for a finite number of transitions and the percentile optimization problem for a wide class of uncertain Markov decision processes which asks to find a policy with the optimal probability to reach a given reward objective. To the best of our knowledge, unlike other optimality criteria, the finite horizon problem has not been considered for the case of bounded-parameter Markov decision processes, and the percentile optimization problem has only been considered for very special cases. Unlike most problems in the Markov decision process research context, dynamic programming is not applicable, as the usual subdivision in independent subproblems in each state is not anymore possible. Justified by this observation, we establish NP-hardness results for these problems by showing appropriate reductions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.