2022
DOI: 10.1609/aaai.v36i6.20602
|View full text |Cite
|
Sign up to set email alerts
|

Distillation of RL Policies with Formal Guarantees via Variational Abstraction of Markov Decision Processes

Abstract: We consider the challenge of policy simplification and verification in the context of policies learned through reinforcement learning (RL) in continuous environments. In well-behaved settings, RL algorithms have convergence guarantees in the limit. While these guarantees are valuable, they are insufficient for safety-critical applications. Furthermore, they are lost when applying advanced techniques such as deep-RL. To recover guarantees when applying advanced RL algorithms to more complex environments with (… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 33 publications
(44 reference statements)
0
7
0
Order By: Relevance
“…The optimization process relies on a temperature 𝜆 ∈ [0, 1) that controls the continuity of the latent space learned, the zero-temperature corresponding to a discrete latent state space. This procedure guarantees M 𝜃 to be probably approximately bisimilarly close [12,16,27] to M as 𝜆 → 0: in a nutshell, bisimulation metrics imply the closeness of the two models in terms of probability measures and expected return [13,14].…”
Section: Latent Space Modelingmentioning
confidence: 94%
See 3 more Smart Citations
“…The optimization process relies on a temperature 𝜆 ∈ [0, 1) that controls the continuity of the latent space learned, the zero-temperature corresponding to a discrete latent state space. This procedure guarantees M 𝜃 to be probably approximately bisimilarly close [12,16,27] to M as 𝜆 → 0: in a nutshell, bisimulation metrics imply the closeness of the two models in terms of probability measures and expected return [13,14].…”
Section: Latent Space Modelingmentioning
confidence: 94%
“…Wasserstein Auto-encoded MDPs (WAE-MDPs) [11] are latent space models that are trained based on the optimal transport from trajectory distributions, resulting from the execution of the RL agent policy in the real environment M, to that reconstructed from the latent model M 𝜃 . The optimization process relies on a temperature 𝜆 ∈ [0, 1) that controls the continuity of the latent space learned, the zero-temperature corresponding to a discrete latent state space.…”
Section: Latent Space Modelingmentioning
confidence: 99%
See 2 more Smart Citations
“…Policy distillation involves transformation of a policy from one format to another while keeping the essential inputoutput as similar as possible. This could be transforming a neural net or Markov Decision Process (MDP) into a smaller network or MDP [20], or into a different form entirely, such as a saliency map [21] or tree [22], [13]. For policy distillation, our approach is based on VIPER [13] because it is flexible in terms of being able to use any method to learn the expert policy.…”
Section: Background and Related Workmentioning
confidence: 99%