Haichuan Gao scite author profile

Undiscounted return is an important setup in reinforcement learning (RL) and characterizes many real-world problems. However, optimizing an undiscounted return often causes training instability. The causes of this instability problem have not been analyzed in-depth by existing studies. In this article, this problem is analyzed from the perspective of value estimation. The analysis result indicates that the instability originates from transient traps that are caused by inconsistently selected actions. However, selecting one consistent action in the same state limits exploration. For balancing exploration effectiveness and training stability, a novel sampling method called last-visit sampling (LVS) is proposed to ensure that a part of actions is selected consistently in the same state. The LVS method decomposes the state-action value into two parts, i.e., the last-visit (LV) value and the revisit value. The decomposition ensures that the LV value is determined by consistently selected actions. We prove that the LVS method can eliminate transient traps while preserving optimality. Also, we empirically show that the method can stabilize the training processes of five typical tasks, including vision-based navigation and manipulation tasks.

show abstract

Fast Counterfactual Inference for History-Based Reinforcement Learning

Gao

Zhang

Yang

et al. 2023

AAAI

View full text Add to dashboard Cite

Incorporating sequence-to-sequence models into history-based Reinforcement Learning (RL) provides a general way to extend RL to partially-observable tasks. This method compresses history spaces according to the correlations between historical observations and the rewards. However, they do not adjust for the confounding correlations caused by data sampling and assign high beliefs to uninformative historical observations, leading to limited compression of history spaces. Counterfactual Inference (CI), which estimates causal effects by single-variable intervention, is a promising way to adjust for confounding. However, it is computationally infeasible to directly apply the single-variable intervention to a huge number of historical observations. This paper proposes to perform CI on observation sub-spaces instead of single observations and develop a coarse-to-fine CI algorithm, called Tree-based History Counterfactual Inference (T-HCI), to reduce the number of interventions exponentially. We show that T-HCI is computationally feasible in practice and brings significant sample efficiency gains in various challenging partially-observable tasks, including Maze, BabyAI, and robot manipulation tasks.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Haichuan Gao

Trajectory Planning for an Autonomous Vehicle in Spatially Constrained Environments

Adaptability Preserving Domain Decomposition for Stabilizing Sim2Real Reinforcement Learning

Down-Sized Initialization for Optimization-Based Unstructured Trajectory Planning by Only Optimizing Critical Variables

Partial Consistency for Stabilizing Undiscounted Reinforcement Learning

Fast Counterfactual Inference for History-Based Reinforcement Learning

Contact Info

Product

Resources

About