Human-in-the-Loop Synthesis for Partially Observable Markov Decision Processes

Carr, Steven; Jansen, Nils; Wimmer, Ralf; Fu, Jie; Topcu, Ufuk

doi:10.23919/acc.2018.8431911

Cited by 10 publications

(4 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Junges et al (2018) construct an FSC using parameter synthesis for Markov chains, which is known to be ETR-complete (Junges, Katoen, et al, 2021), whereas NP ⊆ ETR ⊆ PSPACE. Carr et al (2018) render common POMDP scenarios as arcade games to capture human preferences that are formally cast into FSCs and subsequently verified. Ahmadi et al (2020) use control barrier functions to compute safe reachable sets in the belief space of POMDPs.…”

Section: Related Workmentioning

confidence: 99%

Task-Aware Verifiable RNN-Based Policies for Partially Observable Markov Decision Processes

Carr

Jansen

Topcu

2021

jair

Self Cite

View full text Add to dashboard Cite

Partially observable Markov decision processes (POMDPs) are models for sequential decision-making under uncertainty and incomplete information. Machine learning methods typically train recurrent neural networks (RNN) as effective representations of POMDP policies that can efficiently process sequential data. However, it is hard to verify whether the POMDP driven by such RNN-based policies satisfies safety constraints, for instance, given by temporal logic specifications. We propose a novel method that combines techniques from machine learning with the field of formal methods: training an RNN-based policy and then automatically extracting a so-called finite-state controller (FSC) from the RNN. Such FSCs offer a convenient way to verify temporal logic constraints. Implemented on a POMDP, they induce a Markov chain, and probabilistic verification methods can efficiently check whether this induced Markov chain satisfies a temporal logic specification. Using such methods, if the Markov chain does not satisfy the specification, a byproduct of verification is diagnostic information about the states in the POMDP that are critical for the specification. The method exploits this diagnostic information to either adjust the complexity of the extracted FSC or improve the policy by performing focused retraining of the RNN. The method synthesizes policies that satisfy temporal logic specifications for POMDPs with up to millions of states, which are three orders of magnitude larger than comparable approaches.

show abstract

Section: Related Workmentioning

confidence: 99%

Task-Aware Verifiable RNN-Based Policies for Partially Observable Markov Decision Processes

Carr

Jansen

Topcu

2021

jair

Self Cite

View full text Add to dashboard Cite

show abstract

“…As they depend on probability distributions on partially observed states, optimal policies for mixed-observability MDPs and POMDPs are generally difficult to compute exactly [40], [41]. In this simulation, we used a randomized approximation of an optimal policy based on combining optimal actions for MDPs where beliefs are known, with weights corresponding to the probability distribution of the beliefs [42]. The light blue graph in Figure 5 describes average rewards (17), in analogy to the left side of Figure 3.…”

Section: Optimal Deception With Imperfect Knowledgementioning

confidence: 99%

Deception in Optimal Control

Ornik

Topcu

2018

2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton)

Self Cite

View full text Add to dashboard Cite

In this paper, we consider an adversarial scenario where one agent seeks to achieve an objective and its adversary seeks to learn the agent's intentions and prevent the agent from achieving its objective. The agent has an incentive to try to deceive the adversary about its intentions, while at the same time working to achieve its objective. The primary contribution of this paper is to introduce a mathematically rigorous framework for the notion of deception within the context of optimal control. The central notion introduced in the paper is that of a belief-induced reward: a reward dependent not only on the agent's state and action, but also adversary's beliefs. Design of an optimal deceptive strategy then becomes a question of optimal control design on the product of the agent's state space and the adversary's belief space. The proposed framework allows for deception to be defined in an arbitrary control system endowed with a reward function, as well as with additional specifications limiting the agent's control policy. In addition to defining deception, we discuss design of optimally deceptive strategies under uncertainties in agent's knowledge about the adversary's learning process. In the latter part of the paper, we focus on a setting where the agent's behavior is governed by a Markov decision process, and show that the design of optimally deceptive strategies under lack of knowledge about the adversary naturally reduces to previously discussed problems in control design on partially observable or uncertain Markov decision processes. Finally, we present two examples of deceptive strategies: a "cops and robbers" scenario and an example where an agent may use camouflage while moving. We show that optimally deceptive strategies in such examples follow the intuitive idea of how to deceive an adversary in the above settings.

show abstract

“…In contrast to our work, neither [20] nor [21] specializes on planning problems, and there is neither an implementation available nor any analysis how well these methods scale to systems of relevant size. Instead of automated abstraction, an interactive human-in-the-loop approach for strategy synthesis in POMDPs is described in [22], but such an approach, in contrast to the method described here, may not be fully automated. The strategies obtained by the method in this paper are finitememory strategies.…”

Section: Introductionmentioning

confidence: 99%

Strategy Synthesis for POMDPs in Robot Planning via Game-Based Abstractions

Winterer

Junges

Wimmer

et al. 2021

IEEE Trans. Automat. Contr.

Self Cite

View full text Add to dashboard Cite

We study synthesis problems with constraints in partially observable Markov decision processes (POMDPs), where the objective is to compute a strategy for an agent that is guaranteed to satisfy certain safety and performance specifications. Verification and strategy synthesis for POMDPs are, however, computationally intractable in general. We alleviate this difficulty by focusing on planning applications and exploiting typical structural properties of such scenarios; for instance, we assume that the agent has the ability to observe its own position inside an environment. We propose an abstraction refinement framework which turns such a POMDP model into a (fully observable) probabilistic two-player game (PG). For the obtained PGs, efficient verification and synthesis tools allow to determine strategies with optimal safety and performance measures, which approximate optimal schedulers on the POMDP. If the approximation is too coarse to satisfy the given specifications, an refinement scheme improves the computed strategies. As a running example, we use planning problems where an agent moves inside an environment with randomly moving obstacles and restricted observability. We demonstrate that the proposed method advances the state of the art by solving problems several orders-of-magnitude larger than those that can be handled by existing POMDP solvers. Furthermore, this method gives guarantees on safety constraints, which is not supported by the majority of the existing solvers.

show abstract

Human-in-the-Loop Synthesis for Partially Observable Markov Decision Processes

Cited by 10 publications

References 37 publications

Task-Aware Verifiable RNN-Based Policies for Partially Observable Markov Decision Processes

Task-Aware Verifiable RNN-Based Policies for Partially Observable Markov Decision Processes

Deception in Optimal Control

Strategy Synthesis for POMDPs in Robot Planning via Game-Based Abstractions

Contact Info

Product

Resources

About