Advice-Guided Reinforcement Learning in a non-Markovian Environment

Neider, Daniel; Gaglione, Jean-Raphaël; Gavran, Ivan; Topcu, Ufuk; Xu, Zhe

doi:10.1609/aaai.v35i10.17096

Cited by 13 publications

(8 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While most literature assumes a reward machine is given (Icarte et al 2018;Hahn et al 2019), the problem of learning the machine from observations has only been considered recently. The work in (Xu et al 2020;Neider et al 2021) explores a solving approach based on satisfiability. In (Icarte et al 2019), the problem of learning a reward machine is viewed through the lens of discrete optimization.…”

Section: Related Workmentioning

confidence: 99%

“…However, many problem domains require taking into account this history. Examples include learning in settings with sparse rewards (Neider et al 2021), rewards defined as regular expressions and formal logics (Camacho et al 2019), and decision-making under partial observability (Icarte et al 2019). In such settings, the non-Markovian reward signal may not be known, but can be learned from traces of behavior.…”

Section: Introductionmentioning

confidence: 99%

“…While much progress has been made on learning and leveraging reward machines for decision processes with non-Markovian rewards (Xu et al 2020(Xu et al , 2021Abadi and Brafman 2020;Gaon and Brafman 2020;Neider et al 2021;Rens et al 2021), the more general setting where rewards exhibit both non-Markovian and stochastic dynamics has not been addressed. In this paper, we make progress on this front by introducing probabilistic reward machines (PRMs).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Inferring Probabilistic Reward Machines from Non-Markovian Reward Signals for Reinforcement Learning

Dohmen

Topper

Atia

et al. 2022

ICAPS

View full text Add to dashboard Cite

The success of reinforcement learning in typical settings is predicated on Markovian assumptions on the reward signal by which an agent learns optimal policies. In recent years, the use of reward machines has relaxed this assumption by enabling a structured representation of non-Markovian rewards. In particular, such representations can be used to augment the state space of the underlying decision process, thereby facilitating non-Markovian reinforcement learning. However, these reward machines cannot capture the semantics of stochastic reward signals. In this paper, we make progress on this front by introducing probabilistic reward machines (PRMs) as a representation of non-Markovian stochastic rewards. We present an algorithm to learn PRMs from the underlying decision process and prove results around its correctness and convergence.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Inferring Probabilistic Reward Machines from Non-Markovian Reward Signals for Reinforcement Learning

Dohmen

Topper

Atia

et al. 2022

ICAPS

View full text Add to dashboard Cite

show abstract

“…In most real-world problems, the rewards do not depend on the immediate state and the chosen action but rather on the agent's visited states and performed actions. In such environments, the Markovian assumption (MA) does not hold, and the reward function has a temporal nature [31], that the agent receives its rewards for complex, temporallyextended behaviors sparsely. For example, a robot should be rewarded for delivering coffee only if a user previously requested it.…”

Section: Introductionmentioning

confidence: 99%

“…We use a Transformer-based architecture [36] to capture long-horizon dependencies for histories of state-action pairs, which include all reward-relevant historical information. Moreover, in non-Markovian environments, the agent receives its reward sparsely for complex actions over a long period of time [31], which is not conducive to training the Transformerbased policy representation model. We construct a sample categorical distribution to sample higher cumulative reward trajectories with a higher probability.…”

Section: Introductionmentioning

confidence: 99%

Policy Dispersion in Non-Markovian Environment

Qu¹,

Cao²,

Yang³

et al. 2023

Preprint

View full text Add to dashboard Cite

Markov Decision Process (MDP) presents a mathematical framework to formulate the learning processes of agents in reinforcement learning. MDP is limited by the Markovian assumption that a reward only depends on the immediate state and action. However, a reward sometimes depends on the history of states and actions, which may result in the decision process in a non-Markovian environment. In such environments, agents receive rewards via temporally-extended behaviors sparsely, and the learned policies may be similar. This leads the agents acquired with similar policies generally overfit to the given task and can not quickly adapt to perturbations of environments. To resolve this problem, this paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment, in which a policy dispersion scheme is designed for seeking diverse policy representation. Specifically, we first adopt a transformer-based method to learn policy embeddings. Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results show that this dispersion scheme can obtain more expressive diverse policies, which then derive more robust performance than recent learning baselines under various learning environments.

show abstract

Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples

Ojha

Neider

et al. 2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Advice-Guided Reinforcement Learning in a non-Markovian Environment

Cited by 13 publications

References 18 publications

Inferring Probabilistic Reward Machines from Non-Markovian Reward Signals for Reinforcement Learning

Inferring Probabilistic Reward Machines from Non-Markovian Reward Signals for Reinforcement Learning

Policy Dispersion in Non-Markovian Environment

Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples

Contact Info

Product

Resources

About