Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints

Fu, Jie; Topcu, Ufuk

doi:10.15607/rss.2014.x.039

Cited by 110 publications

(107 citation statements)

References 22 publications

(25 reference statements)

Supporting

Mentioning

107

Contrasting

Order By: Relevance

“…The underlying task is to learn these probabilities and compute a policy that maximizes the probability of reaching the target. We use a PAC-MDP learning algorithm similar to that shown in [Fu and Topcu, 2014]. In order to mitigate the high sampling requirement and learning time mentioned earlier, we apply the proposed reduction technique to reduce the distributions that need to be sampled without sacrificing the PAC guarantees.…”

Section: Reductions In Gridworlds With Ltl Objectivementioning

confidence: 99%

“…PAC learning. We now run a modified version of the R-max learning algorithm presented in [Fu and Topcu, 2014] on one of the reduced 10 × 10 MDPs. Explicitly, we aim to learn a policy that with probability at least 1 − δ will be ε-optimal in maximizing reachability probability.…”

Section: Learning In Gridworlds With Ltl Objectivesmentioning

confidence: 99%

“…Another way to deal with partially specified MDPs is to use learning techniques [Kael-bling et al, 1996]. In this approach, and in order to obtain a PAC output, the model (i.e., its transition probabilities) are learned up to some precision with some desired confidence and then the MDP is processed [Wen and Topcu, 2016;Fu and Topcu, 2014]. A drawback of learning is that it requires a large number of samples -and, in turn, a large amount a time -to learn the transition probabilities with desirable precision and confidence parameters [Kawaguchi, 2016;Kolter and Ng, 2009;Russell et al, 2015].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Reduction Techniques for Model Checking and Learning in MDPs

Bharadwaj

Roux

Pérez

et al. 2017

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence

Self Cite

View full text Add to dashboard Cite

Omega-regular objectives in Markov decision processes (MDPs) reduce to reachability: find a policy which maximizes the probability of reaching a target set of states. Given an MDP, an initial distribution, and a target set of states, such a policy can be computed by most probabilistic model checking tools. If the MDP is only partially specified, i.e., some probabilities are unknown, then model-learning techniques can be used to statistically approximate the probabilities and enable the computation of the desired policy. For fully specified MDPs, reducing the size of the MDP translates into faster model checking; for partially specified MDPs, into faster learning. We provide reduction techniques that allow us to remove irrelevant transition probabilities: transition probabilities (known, or to be learned) that do not influence the maximal reachability probability. Among other applications, these reductions can be seen as a pre-processing of MDPs before model checking or as a way to reduce the number of experiments required to obtain a good approximation of an unknown MDP.

show abstract

Section: Reductions In Gridworlds With Ltl Objectivementioning

confidence: 99%

Section: Learning In Gridworlds With Ltl Objectivesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Reduction Techniques for Model Checking and Learning in MDPs

Bharadwaj

Roux

Pérez

et al. 2017

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence

Self Cite

View full text Add to dashboard Cite

show abstract

“…Safe or constrained (e.g., by temporal logic specifications) exploration has also been investigated in the learning literature. Some recent examples include [13,14]. An overview on safe exploration using reinforcement learning can be found in [15].…”

Section: Introductionmentioning

confidence: 99%

Safety-Constrained Reinforcement Learning for MDPs

Junges

Jansen

Dehnert

et al. 2016

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Abstract. We consider controller synthesis for stochastic and partially unknown environments in which safety is essential. Specifically, we abstract the problem as a Markov decision process in which the expected performance is measured using a cost function that is unknown prior to run-time exploration of the state space. Standard learning approaches synthesize cost-optimal strategies without guaranteeing safety properties. To remedy this, we first compute safe, permissive strategies. Then, exploration is constrained to these strategies and thereby meets the imposed safety requirements. Exploiting an iterative learning procedure, the resulting policy is safety-constrained and optimal. We show correctness and completeness of the method and discuss the use of several heuristics to increase its scalability. Finally, we demonstrate the applicability by means of a prototype implementation.

show abstract

“…For instance, [42] focuses on fusing human and machine "perception." Likewise, attempts to blend human and machine "decision making" occur in the machine learning [60], [36], control theory [29], and human robot interaction literature [32]. A special case of shared decision making is shared control: fuse human and robot platform commands.…”

Section: Introductionmentioning

confidence: 99%

Breaking the Human-Robot Deadlock: Surpassing Shared Control Performance Limits with Sparse Human-Robot Interaction

Trautman

2017

Robotics: Science and Systems XIII

View full text Add to dashboard Cite

Abstract-Human machine teaming has, for decades, been conceptualized as a function allocation (FA) or levels of autonomy (LOA) process: the human is suited for some tasks, while the machine is suitable for others, and as machines improve they take over duties previously assigned to humans. A wide variety of methods-including adaptive, adjustable, blended, supervisory and mixed initiative control, implemented discretely or continuously, as potential fields, as virtual fixture interfaces, or haptic interfaces-are derivatives of FA/LOA. We formalize FA/LOA (and all their derivatives) under a single mathematical formulation called classical shared control (CSC). Despite the widespread adoption of CSC, we prove that it fails to optimize human and robot agreement and intent if either the human or robot model displays "intention ambiguity" (e.g., the human's intended goal is unclear or the robot finds multiple viable solutions). Practically, this suboptimality can manifest as unnecessary and unresolvable disagreement (an unnecessary deadlock). For instance, if the robot chooses to go left around an obstacle and the human chooses to go right, CSC only provides two solutions: freeze in place or collide with the obstacle (we provide a wide variety of failure examples in [52], https://arxiv.org/abs/1611.09490). We find that CSC suboptimality stems from arbitrating over model samples, rather than over models. Our key insight is thus to arbitrate over human and robot distributions; we prove this method optimizes human and robot agreement and intent and resolves deadlocking. Our key contribution is computationally efficient distribution arbitration: if the human and robot carry N our joint has fewer modes than the individual agent models. We call our approach N min -sparse generalized shared control.

show abstract

Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints

Cited by 110 publications

References 22 publications

Reduction Techniques for Model Checking and Learning in MDPs

Reduction Techniques for Model Checking and Learning in MDPs

Safety-Constrained Reinforcement Learning for MDPs

Breaking the Human-Robot Deadlock: Surpassing Shared Control Performance Limits with Sparse Human-Robot Interaction

Contact Info

Product

Resources

About