2021
DOI: 10.48550/arxiv.2112.10264
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Exploration-exploitation trade-off for continuous-time episodic reinforcement learning with linear-convex models

Abstract: We develop a probabilistic framework for analysing model-based reinforcement learning in the episodic setting. We then apply it to study finite-time horizon stochastic control problems with linear dynamics but unknown coefficients and convex, but possibly irregular, objective function. Using probabilistic representations, we study regularity of the associated cost functions and establish precise estimates for the performance gap between applying optimal feedback control derived from estimated and true model pa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
10
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(11 citation statements)
references
References 19 publications
1
10
0
Order By: Relevance
“…Our main results which were outlined above significantly extend the work on reinforcement learning for continuous-time parametric models which were studied by [4,27,42,17,18,43] among others. We outline our main contributions that correspond to each part of the learning algorithm.…”
Section: Introductionsupporting
confidence: 75%
See 3 more Smart Citations
“…Our main results which were outlined above significantly extend the work on reinforcement learning for continuous-time parametric models which were studied by [4,27,42,17,18,43] among others. We outline our main contributions that correspond to each part of the learning algorithm.…”
Section: Introductionsupporting
confidence: 75%
“…The main ingredient in the exploration phase is to estimate the kernel function G in (1.1), in non-parametric manner. This stands in contrast to existing theoretical works on parametric model-based reinforcement learning problems [4,27,42,17,18,43]. Moreover, in our setting G is the driver of the transient price impact which is an unobserved state variable of the control problem.…”
Section: Non-parametric Kernel Estimationmentioning
confidence: 84%
See 2 more Smart Citations
“…By exploiting the boundedness of ∂ x f and ∂ x g, we establish an a-priori bound of the adjoint processes, and subsequently prove the iterative scheme (1.7) generates Lipschitz continuous policies (φ m ) m∈N 0 (see Proposition 3.7). If the drift and diffusion coefficients are affine in x, then (2.1) and (2.5) can be relaxed to quadratically growing functions, which include as special cases the linear-convex control problems studied in [12,36].…”
Section: Standing Assumptions and Main Resultsmentioning
confidence: 99%