Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence 2019
DOI: 10.24963/ijcai.2019/840
|View full text |Cite
|
Sign up to set email alerts
|

LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning

Abstract: In Reinforcement Learning (RL), an agent is guided by the rewards it receives from the reward function. Unfortunately, it may take many interactions with the environment to learn from sparse rewards, and it can be challenging to specify reward functions that reflect complex reward-worthy behavior. We propose using reward machines (RMs), which are automata-based representations that expose reward function structure, as a normal form representation for reward functions. We show how specifications of reward in va… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
94
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 123 publications
(110 citation statements)
references
References 0 publications
0
94
0
Order By: Relevance
“…Solving MDPs with non-Markovian rewards [Bacchus et al, 1996;Thiébaux et al, 2006;Brafman et al, 2018] with PLTL f /PLDL f rewards is EXPTIME-complete in the domain and EXPTIME in PLTL f /PLDL f rewards, while the latter is 2EXPTIME-complete for LTL f /LDL f rewards [Brafman et al, 2018]. Reinforcement Learning where rewards are based on traces [De Giacomo et al, 2019;Camacho et al, 2019] with PLTL f /PLDL f rewards also gain the exponential improvement. Planning in non-Markovian domains [Brafman and De Giacomo, 2019a], with both the non-Markovian domain and the goal expressed in PLTL f /PLDL f is EXPTIME-complete in the domain and in the goal, vs. 2EXPTIME-complete in the domain and in the goal in the case these are expressed in LTL f /LDL f .…”
Section: Reverse Languages and Afamentioning
confidence: 99%
See 1 more Smart Citation
“…Solving MDPs with non-Markovian rewards [Bacchus et al, 1996;Thiébaux et al, 2006;Brafman et al, 2018] with PLTL f /PLDL f rewards is EXPTIME-complete in the domain and EXPTIME in PLTL f /PLDL f rewards, while the latter is 2EXPTIME-complete for LTL f /LDL f rewards [Brafman et al, 2018]. Reinforcement Learning where rewards are based on traces [De Giacomo et al, 2019;Camacho et al, 2019] with PLTL f /PLDL f rewards also gain the exponential improvement. Planning in non-Markovian domains [Brafman and De Giacomo, 2019a], with both the non-Markovian domain and the goal expressed in PLTL f /PLDL f is EXPTIME-complete in the domain and in the goal, vs. 2EXPTIME-complete in the domain and in the goal in the case these are expressed in LTL f /LDL f .…”
Section: Reverse Languages and Afamentioning
confidence: 99%
“…This exponential improvement affects the computational complexity of problems involving temporal logics on finite traces in several contexts, including planning in nondeterministic domains (FOND) [Camacho et al, 2017;De Giacomo and Rubin, 2018], reactive synthesis [De Giacomo and Vardi, 2015;Camacho et al, 2018], MDPs with non-Markovian rewards [Bacchus et al, 1996;Brafman et al, 2018], reinforcement learning [De Giacomo et al, 2019;Camacho et al, 2019], and non-Markovian planning and decision problems [Brafman and De Giacomo, 2019a;Brafman and De Giacomo, 2019b].…”
Section: Introductionmentioning
confidence: 99%
“…where γ is the MDP's discount factor and Φ : S → R is a real-valued function. The automata structure can be exploited by defining F : (U \ {u A , u R }) × U → R in terms of the automaton states instead of the MDP states (Camacho et al, 2019;Furelos-Blanco et al, 2020):…”
Section: Option Modeling Given a Subgoal Automatonmentioning
confidence: 99%
“…Automaton structures have also been exploited in reward machines to give bonus reward signals. Camacho et al (2019) convert reward functions expressed in various formal languages (e.g., linear temporal logic) into RMs, and propose a reward shaping method that runs value iteration on the RM states. Similarly, Camacho et al (2017) use automata as representations of non-markovian rewards and exploit their structure to guide the search of an MDP planner using reward shaping.…”
Section: Automata In Reinforcement Learningmentioning
confidence: 99%
“…In order to compute the policy, the PUnS instance is first compiled into a reward machine ( [26]) corresponding to a Markov representation for P(ϕ), represented as a deterministic MDP,…”
Section: Planning With Uncertain Specificationsmentioning
confidence: 99%