2012
DOI: 10.1007/978-3-642-33492-4_6
|View full text |Cite
|
Sign up to set email alerts
|

Policy Search in a Space of Simple Closed-form Formulas: Towards Interpretability of Reinforcement Learning

Abstract: Abstract. In this paper, we address the problem of computing interpretable solutions to reinforcement learning (RL) problems. To this end, we propose a search algorithm over a space of simple closed-form formulas that are used to rank actions. We formalize the search for a high-performance policy as a multi-armed bandit problem where each arm corresponds to a candidate policy canonically represented by its shortest formula-based representation. Experiments, conducted on standard benchmarks, show that this appr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
17
0

Year Published

2013
2013
2024
2024

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 23 publications
(17 citation statements)
references
References 22 publications
(19 reference statements)
0
17
0
Order By: Relevance
“…Note that such exact simulations are usually not available in industry. Similarly, in (Maes et al, 2012) Monte Carlo simulations have been drawn in order to identify the best policies. However, the policy search itself has been performed by formalizing a search over a space of simple closed-form formulas as a multi-armed bandit problem.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Note that such exact simulations are usually not available in industry. Similarly, in (Maes et al, 2012) Monte Carlo simulations have been drawn in order to identify the best policies. However, the policy search itself has been performed by formalizing a search over a space of simple closed-form formulas as a multi-armed bandit problem.…”
Section: Related Workmentioning
confidence: 99%
“…This work introduces a genetic programming (GP) approach for autonomously learning interpretable reinforcement learning (RL) policies from previously recorded state transitions. Despite the search of interpretable RL policies being of high academic and industrial interest, little has been published concerning human interpretable and understandable policies trained by data driven learning methods (Maes, Fonteneau, Wehenkel, and Ernst, 2012). Recent research results show that using fuzzy rules in batch RL settings can be considered an adequate solution to this task (Hein, Hentschel, Runkler, and Udluft, 2017b).…”
Section: Introductionmentioning
confidence: 99%
“…In order to approximately solve (4), we adopt the formalism of multiarmed bandits and proceed in two steps: first, we construct a finite set of candidate algorithms (Section IV-A), and then treat each of these algorithms as an arm and use a multiarmed bandit policy to select how to allocate computational time to the performance estimation of the different algorithms (Section IV-B). It is worth mentioning that this two-step approach follows a general methodology for automatic discovery that we already successfully applied to multiarmed bandit policy discovery [12], [13], reinforcement learning policy discovery [14], and optimal control policy discovery [15].…”
Section: Bandit-based Algorithm Discoverymentioning
confidence: 99%
“…One simple approach to approximately solve (4) is to estimate the objective function through an empirical mean computed using a finite set of training problems , drawn from (15) where denotes one outcome of algorithm with budget on problem . To solve (4), one can then compute this approximated objective function for all algorithms and simply return the algorithm with the highest score.…”
Section: B Bandit-based Algorithm Discoverymentioning
confidence: 99%
“…Learning such RL controllers in a way that produces interpretable high-level controllers is the scope of this paper and the proposed approach. Especially for real-world industry problems this is of high interest, since interpretable RL policies are expected to yield higher acceptance from domain experts than black-box solutions (Maes, Fonteneau, Wehenkel, and Ernst, 2012).…”
Section: Introductionmentioning
confidence: 99%