Safety-Constrained Reinforcement Learning for MDPs

Junges, Sebastian; Jansen, Nils; Dehnert, Christian; Topcu, Ufuk; Katoen, Joost-Pieter

doi:10.1007/978-3-662-49674-9_8

Cited by 83 publications

(66 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides suitability, we consider safety of system behavior. Unaltered RL algorithms use trial-and-error style exploration to optimize their behavior yet this may not suit a particular domain [78,92,136,153]. For example, tailoring the insulin delivery policy of an artificial pancreas to the metabolism of an individual requires trial insulin delivery action but these should only be sampled when their outcome is within safe certainty bounds [44].…”

Section: A Classification Of Personalization Settingsmentioning

confidence: 99%

Reinforcement learning for personalization: A systematic literature review

Hengst

Grua

Hassouni

et al. 2020

View full text Add to dashboard Cite

The major application areas of reinforcement learning (RL) have traditionally been game playing and continuous control. In recent years, however, RL has been increasingly applied in systems that interact with humans. RL can personalize digital systems to make them more relevant to individual users. Challenges in personalization settings may be different from challenges found in traditional application areas of RL. An overview of work that uses RL for personalization, however, is lacking. In this work, we introduce a framework of personalization settings and use it in a systematic literature review. Besides setting, we review solutions and evaluation strategies. Results show that RL has been increasingly applied to personalization problems and realistic evaluations have become more prevalent. RL has become sufficiently robust to apply in contexts that involve humans and the field as a whole is growing. However, it seems not to be maturing: the ratios of studies that include a comparison or a realistic evaluation are not showing upward trends and the vast majority of algorithms are used only once. This review can be used to find related work across domains, provides insights into the state of the field and identifies opportunities for future work.

show abstract

Section: A Classification Of Personalization Settingsmentioning

confidence: 99%

Reinforcement learning for personalization: A systematic literature review

Hengst

Grua

Hassouni

et al. 2020

View full text Add to dashboard Cite

show abstract

“…multi-objective mean-payoff objectives [8], objectives over instantaneous costs [10], and parity objectives [7]. Multi-objective problems for MDPs with an unknown cost-function are considered in [33]. Surveys on multi-objective decision making in AI and machine learning can be found in [44] and [47], respectively.…”

Section: Introductionmentioning

confidence: 99%

Multi-cost Bounded Reachability in MDP

Hartmanns

Junges

Katoen

et al. 2018

Tools and Algorithms for the Construction and Analysis of Systems

Self Cite

View full text Add to dashboard Cite

Abstract. We provide an efficient algorithm for multi-objective modelchecking problems on Markov decision processes (MDPs) with multiple cost structures. The key problem at hand is to check whether there exists a scheduler for a given MDP such that all objectives over cost vectors are fulfilled. Reachability and expected cost objectives are covered and can be mixed. Empirical evaluation shows the algorithm's scalability. We discuss the need for output beyond Pareto curves and exploit the available information from the algorithm to support decision makers.

show abstract

“…A trajectory-based algorithm which combines policy gradient and actor-critic methods was presented to solve a CVaR-constrained problem (Chow et al 2017). For robust MDP problems, with considering a set of general uncertainties (random action, unknown cost and safety hazards), an approach was provided to compute safe and optimal strategies iteratively (Junges et al 2016). Q-learning has also been used to provide risksensitive analysis on the fMRI signals, which provides a better interpretation of the human behavior in a sequential decision task (Shen et al 2014).…”

Section: Related Workmentioning

confidence: 99%

State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

Ma¹,

Yu²

2019

AAAI

View full text Add to dashboard Cite

In the framework of MDP, although the general reward function takes three arguments-current state, action, and successor state; it is often simplified to a function of two arguments-current state and action. The former is called a transition-based reward function, whereas the latter is called a state-based reward function. When the objective involves the expected total reward only, this simplification works perfectly. However, when the objective is risk-sensitive, this simplification leads to an incorrect value. We propose three successively more general state-augmentation transformations (SATs), which preserve the reward sequences as well as the reward distributions and the optimal policy in risk-sensitive reinforcement learning. In risk-sensitive scenarios, firstly we prove that, for every MDP with a stochastic transition-based reward function, there exists an MDP with a deterministic state-based reward function, such that for any given (randomized) policy for the first MDP, there exists a corresponding policy for the second MDP, such that both Markov reward processes share the same reward sequence. Secondly we illustrate that two situations require the proposed SATs in an inventory control problem. One could be using Q-learning (or other learning methods) on MDPs with transition-based reward functions, and the other could be using methods, which are for the Markov processes with a deterministic state-based reward functions, on the Markov processes with general reward functions. We show the advantage of the SATs by considering Value-at-Risk as an example, which is a risk measure on the reward distribution instead of the measures (such as mean and variance) of the distribution. We illustrate the error in the reward distribution estimation from the reward simplification, and show how the SATs enable a variance formula to work on Markov processes with general reward functions. ∞ t=0 γ t−1 R t -in an infinite-horizon MDP with finite state and action spaces, and consider the Value-at-Risk (VaR) objective as a risk-sensitive example. We generalize the transformation in (Ma and Yu 2017) to three successively more general SATs (Cases 1, 2, and 3), give a proof for the most general one, and illustrate the error from the reward simplification on the return distribution.

show abstract

Safety-Constrained Reinforcement Learning for MDPs

Cited by 83 publications

References 22 publications

Reinforcement learning for personalization: A systematic literature review

Reinforcement learning for personalization: A systematic literature review

Multi-cost Bounded Reachability in MDP

State-Augmentation Transformations for Risk-Sensitive Reinforcement Learning

Contact Info

Product

Resources

About