L∗-based learning of Markov decision processes (extended version)

Tappler, Martin; Aichernig, Bernhard K.; Bacci, Giovanni; Eichlseder, Maria; Larsen, Kim Guldstrand

doi:10.1007/s00165-021-00536-5

Cited by 14 publications

(18 citation statements)

References 61 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The following result follows from the convergence of L for label-deterministic MDPs, shown in (Tappler et al 2021). In that work, the authors show that under uniformly randomized testing strategies, the sampling-based L algorithm converges almost surely in the limit to the MDP under learning.…”

Section: Correctness Convergence and Compactnessmentioning

confidence: 92%

“…The L algorithm for learning regular languages (Angluin 1987) is the quintessential example of active inference, and assumes the existence of a minimally adequate teacher capable of answering membership and equivalence queries. This method has been broadly adopted and generalized to learn interface automata (Aarts and Vaan-drager 2010), Mealy machines (Niese 2003), automaton representations of recurrent neural networks Yahav 2018, 2019), and MDPs (Tappler et al 2021).…”

Section: Related Workmentioning

confidence: 99%

“…Lemma 1 ( (Tappler et al 2021)). Two MDPs M and M with n and at most n states, respectively, are equivalent iff…”

Section: Correctness Convergence and Compactnessmentioning

confidence: 99%

“…By resetting the learning rate parameter to very low values at the start of each query and gradually increasing it after a sufficiently long initial period of exploration results in period of essentially random exploration and thus simulates a uniformly random testing strategy. This allows us to apply the convergence arguments from (Tappler et al 2021) to our setting to assert that our learning procedure converges almost surely in the limit to the PRM (or an equivalent PRM) encoding the reward function of the TMDP under learning. In particular, lemma 1 provides a lower bound on the reinforcement learning episode lengths necessary to achieve convergence in the limit.…”

Section: Correctness Convergence and Compactnessmentioning

confidence: 99%

“…The reward-determinism property (c.f. definition 3) is inspired by the label-determinism property used for learning MDPs in (Tappler et al 2021). Algorithm 1 always learns a reward-deterministic PRM, even if the true reward is specified as a non-reward-deterministic PRM.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Inferring Probabilistic Reward Machines from Non-Markovian Reward Signals for Reinforcement Learning

Dohmen

Topper

Atia

et al. 2022

ICAPS

View full text Add to dashboard Cite

The success of reinforcement learning in typical settings is predicated on Markovian assumptions on the reward signal by which an agent learns optimal policies. In recent years, the use of reward machines has relaxed this assumption by enabling a structured representation of non-Markovian rewards. In particular, such representations can be used to augment the state space of the underlying decision process, thereby facilitating non-Markovian reinforcement learning. However, these reward machines cannot capture the semantics of stochastic reward signals. In this paper, we make progress on this front by introducing probabilistic reward machines (PRMs) as a representation of non-Markovian stochastic rewards. We present an algorithm to learn PRMs from the underlying decision process and prove results around its correctness and convergence.

show abstract

Section: Correctness Convergence and Compactnessmentioning

confidence: 92%

Section: Related Workmentioning

confidence: 99%

“…Lemma 1 ( (Tappler et al 2021)). Two MDPs M and M with n and at most n states, respectively, are equivalent iff…”

Section: Correctness Convergence and Compactnessmentioning

confidence: 99%

Section: Correctness Convergence and Compactnessmentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Inferring Probabilistic Reward Machines from Non-Markovian Reward Signals for Reinforcement Learning

Dohmen

Topper

Atia

et al. 2022

ICAPS

View full text Add to dashboard Cite

show abstract

Efficient Black-Box Checking via Model Checking with Strengthened Specifications

Shijubo

Waga

Suenaga

2021

Runtime Verification

View full text Add to dashboard Cite

We introduce a novel methodology for testing stochastic black-box systems, frequently encountered in embedded systems. Our approach enhances the established black-box checking (BBC) technique to address stochastic behavior. Traditional BBC primarily involves iteratively identifying an input that breaches the system's specifications by executing the following three phases: the learning phase to construct an automaton approximating the black box's behavior, the synthesis phase to identify a candidate counterexample from the learned automaton, and the validation phase to validate the obtained candidate counterexample and the learned automaton against the original black-box system. Our method, ProbBBC, refines the conventional BBC approach by (1) employing an active Markov Decision Process (MDP) learning method during the learning phase, (2) incorporating probabilistic model checking in the synthesis phase, and (3) applying statistical hypothesis testing in the validation phase. ProbBBC uniquely integrates these techniques rather than merely substituting each method in the traditional BBC; for instance, the statistical hypothesis testing and the MDP learning procedure exchange information regarding the black-box system's observation with one another. The experiment results suggest that ProbBBC outperforms an existing method, especially for systems with limited observation. CCS Concepts: • Theory of computation → Verification by model checking; • Software and its engineering → Formal software verification; • Computer systems organization → Embedded systems.

show abstract