Transition-based versus state-based reward functions for MDPs with Value-at-Risk

Ma, Shuai; Yu, Jia Yuan

doi:10.1109/allerton.2017.8262843

Cited by 2 publications

(7 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since many RL methods require the reward function to be deterministic and statebased, the transformation is needed for the MDPs with other types of reward functions in the risk-sensitive problems. We generalize the transformation (Ma and Yu 2017) in different settings, and consider VaR as an example to show the effect of reward simplification on distribution.…”

Section: Resultsmentioning

confidence: 99%

“…In RL, when the expected return is considered, and the Q-function or the value function is accessed, which implies such a reward simplification. The effect of the reward simplification on return distribution in a finite-horizon Markov reward process has been studied in (Ma and Yu 2017). Here we estimate the distribution with assuming it is approximately normal, illustrate the similar effect on return distribution, and generalize the transformation for more practical cases.…”

Section: Markov Decision Processesmentioning

confidence: 96%

“…Generate the state space S † = S × S; for all x † = (x, y) where x, y ∈ S do Construct the reward function r † π (x † ) = r π (x, y); Construct the transition kernel p † π (x † | y † ) = p π (y | x) for all y † = (·, x) ∈ S † , and p † π (x † | y † ) = 0 otherwise; Set the initial state distribution µ † (x † ) = µ(x)p π (y | x); end for have a Markov reward process illustrated in Figure 2(a). In order to attach each possible reward value to a state to construct a r DS , we consider each transition as a state, and apply the State-Transition Transformation (Algorithm 1 (Ma and Yu 2017)). The transformed Markov reward process is presented in Figure 2(b), where only some of the transitions are shown.…”

Section: Sat For Casementioning

confidence: 99%

“…We study the return-the discounted total reward ∞ t=0 γ t−1 R t -in an infinite-horizon MDP with finite state and action spaces, and consider the Value-at-Risk (VaR) objective as a risk-sensitive example. We generalize the transformation in (Ma and Yu 2017) to three successively more general SATs (Cases 1, 2, and 3), give a proof for the most general one, and illustrate the error from the reward simplification on the return distribution.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

Ma¹,

Yu²

2019

AAAI

Self Cite

View full text Add to dashboard Cite

In the framework of MDP, although the general reward function takes three arguments-current state, action, and successor state; it is often simplified to a function of two arguments-current state and action. The former is called a transition-based reward function, whereas the latter is called a state-based reward function. When the objective involves the expected total reward only, this simplification works perfectly. However, when the objective is risk-sensitive, this simplification leads to an incorrect value. We propose three successively more general state-augmentation transformations (SATs), which preserve the reward sequences as well as the reward distributions and the optimal policy in risk-sensitive reinforcement learning. In risk-sensitive scenarios, firstly we prove that, for every MDP with a stochastic transition-based reward function, there exists an MDP with a deterministic state-based reward function, such that for any given (randomized) policy for the first MDP, there exists a corresponding policy for the second MDP, such that both Markov reward processes share the same reward sequence. Secondly we illustrate that two situations require the proposed SATs in an inventory control problem. One could be using Q-learning (or other learning methods) on MDPs with transition-based reward functions, and the other could be using methods, which are for the Markov processes with a deterministic state-based reward functions, on the Markov processes with general reward functions. We show the advantage of the SATs by considering Value-at-Risk as an example, which is a risk measure on the reward distribution instead of the measures (such as mean and variance) of the distribution. We illustrate the error in the reward distribution estimation from the reward simplification, and show how the SATs enable a variance formula to work on Markov processes with general reward functions. ∞ t=0 γ t−1 R t -in an infinite-horizon MDP with finite state and action spaces, and consider the Value-at-Risk (VaR) objective as a risk-sensitive example. We generalize the transformation in (Ma and Yu 2017) to three successively more general SATs (Cases 1, 2, and 3), give a proof for the most general one, and illustrate the error from the reward simplification on the return distribution.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Markov Decision Processesmentioning

confidence: 96%

Section: Sat For Casementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

Ma¹,

Yu²

2019

AAAI

Self Cite

View full text Add to dashboard Cite

show abstract

“…In many cases, conditional VaR (also known as expected shortfall) is preferred over VaR since it is coherent [19], i.e., it has some intuitively reasonable properties (convexity, for example). However, when the return can be assumed to be approximately normally distributed, VaR can be simply estimated with E(Φ) and V(Φ) [20].…”

Section: Quantile-based Riskmentioning

confidence: 99%

A Scheme for Dynamic Risk-Sensitive Sequential Decision Making

Ma¹,

Yu²,

Şatır³

2019

Preprint

Self Cite

View full text Add to dashboard Cite

We present a scheme for sequential decision making with a risk-sensitive objective and constraints in a dynamic environment. A neural network is trained as an approximator of the mapping from parameter space to space of risk and policy with risk-sensitive constraints. For a given risksensitive problem, in which the objective and constraints are, or can be estimated by, functions of the mean and variance of return, we generate a synthetic dataset as training data. Parameters defining a targeted process might be dynamic, i.e., they might vary over time, so we sample them within specified intervals to deal with these dynamics. We show that: i). Most risk measures can be estimated using return variance; ii). By virtue of the state-augmentation transformation, practical problems modeled by Markov decision processes with stochastic rewards can be solved in a risk-sensitive scenario; and iii). The proposed scheme is validated by a numerical experiment.

show abstract

Transition-based versus state-based reward functions for MDPs with Value-at-Risk

Cited by 2 publications

References 26 publications

State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

A Scheme for Dynamic Risk-Sensitive Sequential Decision Making

Contact Info

Product

Resources

About