Online Learning of non-Markovian Reward Models

Rens, Gavin; Raskin, Jean-François; Reynouard, Raphaël; Marra, Giuseppe

doi:10.5220/0010212000740086

Cited by 5 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since then, RMs have been used for solving problems in planning (Illanes et al, 2019(Illanes et al, , 2020, robotics DeFazio & Zhang, 2021;Camacho et al, 2020Camacho et al, , 2021, multi-agent systems (Neary et al, 2021), lifelong RL (Zheng et al, 2021), and partial observability (Toro Icarte et al, 2019a also considered both Mealy and Moore versions of RMs, though theirs only output numbers (like our simple RMs) instead of reward functions. Finally, there has been prominent work on how to learn RMs from experience (e.g., Toro Icarte et al, 2019aIcarte et al, , 2019bXu et al, 2020aXu et al, , 2020bFurelos-Blanco et al, 2020aRens & Raskin, 2020;Hasanbeig et al, 2021;Velasquez et al, 2021) Since our previous work, we have gained practical experience and new theoretical insights about reward machines -which were reflected in this paper. In particular, we provided a cleaner definition of reward machines and QRM.…”

Section: Reward Machine Researchmentioning

confidence: 78%

“…Many questions remain open regarding reward machines. For instance, we know how to learn reward machines from experience (Toro Icarte et al, 2019aXu et al, 2020aXu et al, , 2020bFurelos-Blanco et al, 2020aRens & Raskin, 2020), but all these methods assume access to a correct labelling function. How to learn RMs and a labelling function at the same time remains unknown.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning

Icarte

Klassen

Valenzano³

et al. 2022

jair

100

View full text Add to dashboard Cite

Reinforcement learning (RL) methods usually treat reward functions as black boxes. As such, these methods must extensively interact with the environment in order to discover rewards and optimal policies. In most RL applications, however, users have to program the reward function and, hence, there is the opportunity to make the reward function visible – to show the reward function’s code to the RL agent so it can exploit the function’s internal structure to learn optimal policies in a more sample efficient manner. In this paper, we show how to accomplish this idea in two steps. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions while exposing reward function structure. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning. Experiments on tabular and continuous domains, across different tasks and RL agents, show the benefits of exploiting reward structure with respect to sample efficiency and the quality of resultant policies. Finally, by virtue of being a form of finite state machine, reward machines have the expressive power of a regular language and as such support loops, sequences and conditionals, as well as the expression of temporally extended properties typical of linear temporal logic and non-Markovian reward specification.

show abstract

Section: Reward Machine Researchmentioning

confidence: 78%

Section: Discussionmentioning

confidence: 99%

Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning

Icarte

Klassen

Valenzano³

et al. 2022

jair

100

View full text Add to dashboard Cite

show abstract

“…Finally, we note that different approaches to learn RMs were proposed simultaneously, or shortly after, our original publication (e.g., Xu et al, 2020a,b;Furelos-Blanco et al, 2020;Rens et al, 2020;Gaon and Brafman, 2020;Memarian et al, 2020;Neider et al, 2021;Hasanbeig et al, 2021). They all learn reward machines in fully observable domains.…”

Section: Related Workmentioning

confidence: 95%

“…These include methods that learn reward machines using a SAT solver (Xu et al, 2020a;Neider et al, 2021), use inductive logic programming (Furelos-Blanco et al, 2020), and by using program synthesis (Hasanbeig et al, 2021). There has also been work on adapting the L * algorithm (Angluin, 1987) to learn RMs given the model of the MDP (Rens et al, 2020), expert demonstrations (Memarian et al, 2020), or in a pure RL setting (Gaon and Brafman, 2020;Xu et al, 2020b).…”

Section: Related Workmentioning

confidence: 99%

Learning Reward Machines: A Study in Partially Observable Reinforcement Learning

Icarte¹,

Waldie²,

Klassen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Reinforcement learning (RL) is a central problem in artificial intelligence. This problem consists of defining artificial agents that can learn optimal behaviour by interacting with an environment -where the optimal behaviour is defined with respect to a reward signal that the agent seeks to maximize. Reward machines (RMs) provide a structured, automata-based representation of a reward function that enables an RL agent to decompose an RL problem into structured subproblems that can be efficiently learned via off-policy learning. Here we show that RMs can be learned from experience, instead of being specified by the user, and that the resulting problem decomposition can be used to effectively solve partially observable RL problems. We pose the task of learning RMs as a discrete optimization problem where the objective is to find an RM that decomposes the problem into a set of subproblems such that the combination of their optimal memoryless policies is an optimal policy for the original problem. We show the effectiveness of this approach on three partially observable domains, where it significantly outperforms A3C, PPO, and ACER, and discuss its advantages, limitations, and broader potential. 1

show abstract

“…In contrast, our maximumlikelihood approach does not a priori require any structure of the specification or the spatial MDP environment. Meanwhile, [11,13,27,31,37] use Angluin [5]'s L * algorithm to learn a TA, relying on an oracle for equivalence and membership queries. We assume that the agent cannot access an oracle and must learn the TA fully autonomously, which aligns with the standard setup of model-free RL (note that L * was not originally developed for RL applications).…”

Section: Related Researchmentioning

confidence: 99%

Learning Task Automata for Reinforcement Learning Using Hidden Markov Models

Abate,

Almulla,

Fox

et al. 2023

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

Training reinforcement learning (RL) agents using scalar reward signals is often infeasible when an environment has sparse and non-Markovian rewards. Moreover, handcrafting these reward functions before training is prone to misspecification. We learn non-Markovian finite task specifications as finite-state ‘task automata’ from episodes of agent experience within environments with unknown dynamics. First, we learn a product MDP, a model composed of the specification’s automaton and the environment’s MDP (both initially unknown), by treating it as a partially observable MDP and employing a hidden Markov model learning algorithm. Second, we efficiently distil the task automaton (assumed to be a deterministic finite automaton) from the learnt product MDP. Our automaton enables a task to be decomposed into sub-tasks, so an RL agent can later synthesise an optimal policy more efficiently. It is also an interpretable encoding of high-level task features, so a human can verify that the agent’s learnt tasks have no misspecifications. Finally, we also take steps towards ensuring that the automaton is environment-agnostic, making it well-suited for use in transfer learning.

show abstract

Online Learning of non-Markovian Reward Models

Cited by 5 publications

References 0 publications

Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning

Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning

Learning Reward Machines: A Study in Partially Observable Reinforcement Learning

Learning Task Automata for Reinforcement Learning Using Hidden Markov Models

Contact Info

Product

Resources

About