Joint Inference of Reward Machines and Policies for Reinforcement Learning

Planning in finite stochastic environments is canonically posed as a Markov decision process where the transition and reward structures are explicitly known. Reinforcement learning (RL) lifts the explicitness assumption by working with sampling models instead. Further, with the advent of reward machines, we can relax the Markovian assumption on the reward. Angluin's active grammatical inference algorithm L* has found novel application in explicating reward machines for non-Markovian RL. We propose maintaining the assumption of explicit transition dynamics, but with an implicit non-Markovian reward signal, which must be inferred from experiments. We call this setting non-Markovian planning, as opposed to non-Markovian RL. The proposed approach leverages L* to explicate an automaton structure for the underlying planning objective. We exploit the environment model to learn an automaton faster and integrate it with value iteration to accelerate the planning. We compare against recent non-Markovian RL solutions which leverage grammatical inference, and establish complexity results that illustrate the difference in runtime between grammatical inference in planning and RL settings.

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Active Grammatical Inference for Non-Markovian Planning

Topper

Atia

Trivedi

et al. 2022

“…Sarathy et al (2021) incorporates RL with symbolic planning models to learn new operators -similar to our subtasksto aid in the completion of planning objectives. Meanwhile, Toro Icarte et al (2018; Xu et al (2020);Toro Icarte et al (2022) use reward machines, finite-state machines encoding temporally extended tasks in terms of atomic propositions, to break tasks into stages for which separate policies can be learned. Neary et al (2021b) extends the use of reward machines to the multi-agent RL setting, decomposing team tasks into subtasks for individual learners.…”

Section: Related Workmentioning

confidence: 99%

Verifiable and Compositional Reinforcement Learning Systems

Neary

Verginis

Cubuktepe

et al. 2022

Self Cite

We propose a framework for verifiable and compositional reinforcement learning (RL) in which a collection of RL subsystems, each of which learns to accomplish a separate subtask, are composed to achieve an overall task. The framework consists of a high-level model, represented as a parametric Markov decision process (pMDP) which is used to plan and to analyze compositions of subsystems, and of the collection of low-level subsystems themselves. By defining interfaces between the subsystems, the framework enables automatic decompositions of task specifications, e.g., reach a target set of states with a probability of at least 0.95, into individual subtask specifications, i.e. achieve the subsystem's exit conditions with at least some minimum probability, given that its entry conditions are met. This in turn allows for the independent training and testing of the subsystems; if they each learn a policy satisfying the appropriate subtask specification, then their composition is guaranteed to satisfy the overall task specification. Conversely, if the subtask specifications cannot all be satisfied by the learned policies, we present a method, formulated as the problem of finding an optimal set of parameters in the pMDP, to automatically update the subtask specifications to account for the observed shortcomings. The result is an iterative procedure for defining subtask specifications, and for training the subsystems to meet them. As an additional benefit, this procedure allows for particularly challenging or important components of an overall task to be identified automatically, and focused on, during training. Experimental results demonstrate the presented framework's novel capabilities in both discrete and continuous RL settings. A collection of RL subsystems are trained, using proximal policy optimization algorithms, to navigate different portions of a labyrinth environment. A cross-labyrinth task specification is then decomposed into subtask specifications. Challenging portions of the labyrinth are automatically avoided if their corresponding subsystems cannot learn satisfactory policies within allowed training budgets. Unnecessary subsystems are not trained at all. The result is a compositional RL system that efficiently learns to satisfy task specifications.

“…While much progress has been made on learning and leveraging reward machines for decision processes with non-Markovian rewards (Xu et al 2020(Xu et al , 2021Abadi and Brafman 2020;Gaon and Brafman 2020;Neider et al 2021;Rens et al 2021), the more general setting where rewards exhibit both non-Markovian and stochastic dynamics has not been addressed. In this paper, we make progress on this front by introducing probabilistic reward machines (PRMs).…”

Section: Introductionmentioning

confidence: 99%

Inferring Probabilistic Reward Machines from Non-Markovian Reward Signals for Reinforcement Learning

Dohmen

Topper

Atia

et al. 2022

The success of reinforcement learning in typical settings is predicated on Markovian assumptions on the reward signal by which an agent learns optimal policies. In recent years, the use of reward machines has relaxed this assumption by enabling a structured representation of non-Markovian rewards. In particular, such representations can be used to augment the state space of the underlying decision process, thereby facilitating non-Markovian reinforcement learning. However, these reward machines cannot capture the semantics of stochastic reward signals. In this paper, we make progress on this front by introducing probabilistic reward machines (PRMs) as a representation of non-Markovian stochastic rewards. We present an algorithm to learn PRMs from the underlying decision process and prove results around its correctness and convergence.