Deep reinforcement learning for six degree-of-freedom planetary landing

Gaudet, Brian; Linares, Richard; Furfaro, Roberto

doi:10.1016/j.asr.2019.12.030

Cited by 154 publications

(84 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Specifically, the trained policy's hidden state captures unobserved (potentially time-varying) information such as external forces that are useful in minimizing the cost function. In contrast, a non-recurrent policy (which we will refer to as an MLP policy), which does not maintain a persistent hidden state vector, can only optimize using a set of current observations, actions, and advantages, and will tend to under-perform a recurrent policy on tasks with randomized dynamics, although as we have shown in [19], training with parameter uncertainty can give good results using an MLP policy, provided the parameter uncertainty is not too extreme. After training, although the recurrent policy's network weights are frozen, the hidden state will continue to evolve in response to a sequence of observations and actions, thus making the policy adaptive.…”

Section: Robustnessmentioning

confidence: 99%

Reinforcement learning for angle-only intercept guidance of maneuvering targets

Gaudet

Furfaro

Linares

2020

Aerospace Science and Technology

Self Cite

101

View full text Add to dashboard Cite

We present a novel guidance law that uses observations consisting solely of seeker line of sight angle measurements and their rate of change. The policy is optimized using reinforcement metalearning and demonstrated in a simulated terminal phase of a mid-course exo-atmospheric interception. Importantly, the guidance law does not require range estimation, making it particularly suitable for passive seekers. The optimized policy maps stabilized seeker line of sight angles and their rate of change directly to commanded thrust for the missile's divert thrusters. The use of reinforcement meta-learning allows the optimized policy to adapt to target acceleration, and we demonstrate that the policy performs as well as augmented zero-effort miss guidance with perfect target acceleration knowledge. The optimized policy is computationally efficient and requires minimal memory, and should be compatible with today's flight processors. I. Introduction E -interception of ballistic targets is particularly challenging due to the hit to kill requirement and relatively small size of a ballistic re-entry vehicle (BRV), typically 45 to 60 centimeters in diameter. Successful interception requires both a small miss distance and a suitable impact angle, with miss distance requirements of 50cm implied by the BRV and missile dimensions. Moreover, the missile must autonomously discriminate between threats and decoys. The interception problem is significantly complicated by warheads with limited maneuvering capability. Both spiral and bang-bang maneuvers could potentially be executed by a BRV without compromising the BRV's accuracy. These maneuvers could be executed either in response to the BRV's sensor input (if so equipped) or periodically executed during the portion of the trajectory where interception is likely. Another complication of exo-atmospheric interception is that the high altitude requires the use of divert thrusters rather than control surfaces, with current implementation using pulsed divert thrusters. As the missile burns fuel, its center of mass shifts, and the divert thrusts cause a tumbling motion that requires compensation from the attitude control thrusters. Fuel efficiency is also critical, as the missile loses all control authority when its fuel is depleted.Recent work in exo-atmospheric guidance law development include [1], where a collision geometry based guidance law is developed that attempts to keep the missile on a collision triangle with the target using range and angle measurements. The authors demonstrate improved capture range and miss distance for the case of a non-maneuvering target as compared to a zero effort miss guidance law. In [2], the authors develop a guidance law suitable for exo-atmospheric intercepts using linear quadratic optimization; importantly the guidance law requires an estimate of the relative range and velocity vectors and assumes a non-maneuvering target. In [3], Zarchan develops a pulsed guidance law that removes the zero effort miss during the homing phase by precomputing the requir...

show abstract

Section: Robustnessmentioning

confidence: 99%

Reinforcement learning for angle-only intercept guidance of maneuvering targets

Gaudet

Furfaro

Linares

2020

Aerospace Science and Technology

Self Cite

101

View full text Add to dashboard Cite

show abstract

“…Reinforcement Learning (RL) has recently been successfully applied to landing guidance problems. [9][10][11][12] Importantly, the observations are chosen such that the policy generalizes well to different landing sites. Specifically, the policy can be optimized for a specific landing site, and when deployed can be used for an arbitrary landing site.…”

Section: Initial Conditionsmentioning

confidence: 99%

“…In our previous work with 6-DOF Mars powered descent phase the policy took less than 1mS to run the mapping between estimated state and thruster commands (four small matrix multiplications) on a 3Ghz processor. 12 Since in this work the mapping is updated every six seconds, we do not see any issues with running this on the current generation of space-certified flight computers. A diagram illustrating how the policy interfaces with peripheral spacecraft components is shown in Fig.…”

Section: Initial Conditionsmentioning

confidence: 99%

See 1 more Smart Citation

Terminal adaptive guidance via reinforcement meta-learning: Applications to autonomous asteroid close-proximity operations

2020

Self Cite

View full text Add to dashboard Cite

Current practice for asteroid close proximity maneuvers requires extremely accurate characterization of the environmental dynamics and precise spacecraft positioning prior to the maneuver. This creates a delay of several months between the spacecraft's arrival and the ability to safely complete close proximity maneuvers. In this work we develop an adaptive integrated guidance, navigation, and control system that can complete these maneuvers in environments with unknown dynamics, with initial conditions spanning a large deployment region, and without a shape model of the asteroid. The system is implemented as a policy optimized using reinforcement meta-learning. The spacecraft is equipped with an optical seeker that locks to either a terrain feature, back-scattered light from a targeting laser, or an active beacon, and the policy maps observations consisting of seeker angles and LIDAR range readings directly to engine thrust commands. The policy implements a recurrent network layer that allows the deployed policy to adapt real time to both environmental forces acting on the agent and internal disturbances such as actuator failure and center of mass variation. We validate the guidance system through simulated landing maneuvers in a six degrees-of-freedom simulator. The simulator randomizes the asteroid's characteristics such as solar radiation pressure, density, spin rate, and nutation angle, requiring the guidance and control system to adapt to the environment. We also demonstrate robustness to actuator failure, sensor bias, and changes in the spacecraft's center of mass and inertia tensor. Finally, we suggest a concept of operations for asteroid close proximity maneuvers that is compatible with the guidance system.

show abstract

“…In this proposal, we use a different approach [11] using a recurrent policy and value function. Note that it is possible to train over a wide range of POMDPs using a non-meta RL algorithm [18]. Although such an approach typically results in a robust policy, the policy cannot adapt in real time to novel environments.…”

Section: A Rl Overviewmentioning

confidence: 99%

Six degree-of-freedom body-fixed hovering over unmapped asteroids via LIDAR altimetry and reinforcement meta-learning

Gaudet¹,

Linares

Furfaro

2020

Acta Astronautica

Self Cite

View full text Add to dashboard Cite

We optimize a six degrees of freedom hovering policy using reinforcement meta-learning. The policy maps flash LIDAR measurements directly to on/off spacecraft body-frame thrust commands, allowing hovering at a fixed position and attitude in the asteroid body-fixed reference frame. Importantly, the policy does not require position and velocity estimates, and can operate in environments with unknown dynamics, and without an asteroid shape model or navigation aids. Indeed, during optimization the agent is confronted with a new randomly generated asteroid for each episode, insuring that it does not learn an asteroid's shape, texture, or environmental dynamics. This allows the deployed policy to generalize well to novel asteroid characteristics, which we demonstrate in our experiments. The hovering controller has the potential to simplify mission planning by allowing asteroid body-fixed hovering immediately upon the spacecraft's arrival to an asteroid. This in turn simplifies shape model generation and allows resource mapping via remote sensing immediately upon arrival at the target asteroid.spacecraft's position and velocity [6]. Lee et. al. demonstrates 6-DOF hovering using a control law developed in the Lie group SE(3) [7], but again assumes the spacecraft's state can be inferred, and requires an estimate of the environmental dynamics. Gaudet and Furfaro developed a 3-DOF hovering controller using reinforcement learning [8] and showed improved transient response as compared to an LQR controller; however, the method assumes that the spacecraft's position and velocity can be inferred. None of this work treats the case where the spacecraft arrives at an asteroid and we want the spacecraft to be able to immediately hover in the body-fixed frame when both 1.) there is no knowledge of the environmental dynamics and 2.) there is not an existing shape model that can be used by a navigation system to infer the spacecraft's position and velocity.In this work we focus on the body-fixed hovering problem, where the spacecraft can be commanded to hover at its current position and attitude. By body-fixed, we mean that the spacecraft's position remains fixed with respect to the asteroid's surface. In contrast, when hovering in the asteroid centered inertial reference frame the asteroid rotates below the spacecraft. Unlike previous work, we do not assume that the spacecraft can infer position and velocity from measurements (as is possible with a preexisting shape model) and that the environmental dynamics are unknown (except to the extent required for thruster sizing). The goal is to remain at a constant asteroid body-fixed position and attitude from the commencement of the hovering maneuver. We will assume that the spacecraft is equipped with a flash LIDAR system, gyroscopes that can measure the change in the spacecraft's attitude from the initiation of the hovering maneuver, and rate gyros that measure rotational velocity. We further assume that these sensors can provide measurements every 6s. At the start of the hovering maneuve...

show abstract

Deep reinforcement learning for six degree-of-freedom planetary landing

Cited by 154 publications

References 38 publications

Reinforcement learning for angle-only intercept guidance of maneuvering targets

Reinforcement learning for angle-only intercept guidance of maneuvering targets

Terminal adaptive guidance via reinforcement meta-learning: Applications to autonomous asteroid close-proximity operations

Six degree-of-freedom body-fixed hovering over unmapped asteroids via LIDAR altimetry and reinforcement meta-learning

Contact Info

Product

Resources

About