Iterative reward shaping for non-overshooting altitude control of a wing-in-ground craft based on deep reinforcement learning

Hu, Huan; Zhang, Guiyong; Ding, Lichao; Jiao, Kuikui; Zhang, Zhifan; Zhang, Ji

doi:10.1016/j.robot.2023.104383

Cited by 2 publications

(1 citation statement)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Based on zeroth-order optimization techniques, it uses multiple system trajectories to estimate the policy gradient. There has been a resurgent interest in studying theoretical properties of PO on the LQR problem such as convergence and sample complexity; see e.g., [4]- [7] and the comprehensive survey [8]. Even though global convergence has been shown for the nonconvex PO Research of Feiran Zhao and Keyou You was supported by National Natural Science Foundation of China under Grant no.…”

Section: Introductionmentioning

confidence: 99%

Infinite-horizon Risk-constrained Linear Quadratic Regulator with Average Cost

Zhao

You

Başar

2021

2021 60th IEEE Conference on Decision and Control (CDC)

View full text Add to dashboard Cite

Policy optimization (PO), an essential approach of reinforcement learning for a broad range of system classes, requires significantly more system data than indirect (identification-followed-by-control) methods or behavioralbased direct methods even in the simplest linear quadratic regulator (LQR) problem. In this paper, we take an initial step towards bridging this gap by proposing the data-enabled policy optimization (DeePO) method, which requires only a finite number of sufficiently exciting data to iteratively solve the LQR via PO. Based on a data-driven closed-loop parameterization, we are able to directly compute the policy gradient from a bath of persistently exciting data. Next, we show that the nonconvex PO problem satisfies a projected gradient dominance property by relating it to an equivalent convex program, leading to the global convergence of DeePO. Moreover, we apply regularization methods to enhance certainty-equivalence and robustness of the resulting controller and show an implicit regularization property. Finally, we perform simulations to validate our results.

show abstract

Section: Introductionmentioning

confidence: 99%