Yangchen Pan scite author profile

Yangchen Pan

5Publications

60Citation Statements Received

47Citation Statements Given

How they've been cited

How they cite others

Affiliations

University of Alberta, Indiana University

Publications

Order By: Most citations

Hill Climbing on Value Estimates for Search-control in Dyna

Pan

Yao

Farahmand

et al. 2019

View full text Add to dashboard Cite

Dyna is an architecture for model-based reinforcement learning (RL), where simulated experience from a model is used to update policies or value functions. A key component of Dyna is search-control, the mechanism to generate the state and action from which the agent queries the model, which remains largely unexplored. In this work, we propose to generate such states by using the trajectory obtained from Hill Climbing (HC) the current estimate of the value function. This has the effect of propagating value from high-value regions and of preemptively updating value estimates of the regions that the agent is likely to visit next. We derive a noisy projected natural gradient algorithm for hill climbing, and highlight a connection to Langevin dynamics. We provide an empirical demonstration on four classical domains that our algorithm, HC-Dyna, can obtain significant sample efficiency improvements. We study the properties of different sampling distributions for search-control, and find that there appears to be a benefit specifically from using the samples generated by climbing on current value estimates from low-value to high-value region. 1 We use DQN to refer to the algorithm by [Mnih et al., 2015] that uses ER and target network, but not the exact original architecture.

show abstract

Organizing Experience: a Deeper Look at Replay Mechanisms for Sample-Based Planning in Continuous State Domains

Pan

Zaheer

White

et al. 2018

View full text Add to dashboard Cite

Model-based strategies for control are critical to obtain sample efficient learning. Dyna is a planning paradigm that naturally interleaves learning and planning, by simulating one-step experience to update the action-value function. This elegant planning strategy has been mostly explored in the tabular setting. The aim of this paper is to revisit sample-based planning, in stochastic and continuous domains with learned models. We first highlight the flexibility afforded by a model over Experience Replay (ER). Replay-based methods can be seen as stochastic planning methods that repeatedly sample from a buffer of recent agent-environment interactions and perform updates to improve data efficiency. We show that a model, as opposed to a replay buffer, is particularly useful for specifying which states to sample from during planning, such as predecessor states that propagate information in reverse from a state more quickly. We introduce a semi-parametric model learning approach, called Reweighted Experience Models (REMs), that makes it simple to sample next states or predecessors. We demonstrate that REM-Dyna exhibits similar advantages over replay-based methods in learning in continuous state problems, and that the performance gap grows when moving to stochastic domains, of increasing size.

show abstract

Organizing Experience: A Deeper Look at Replay Mechanisms for Sample-based Planning in Continuous State Domains

Pan¹,

Zaheer²,

White³

et al. 2018

Preprint

View full text Add to dashboard Cite

Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement

Neumann¹,

Lim²,

Joseph³

et al. 2018

Preprint

View full text Add to dashboard Cite

Q-learning can be difficult to use in continuous action spaces, because an optimization has to be solved to find the maximal action for the action-values. A common strategy has been to restrict the functional form of the action-values to be concave in the actions, to simplify the optimization. Such restrictions, however, can prevent learning accurate action-values. In this work, we propose a new policy search objective that facilitates using Q-learning and a framework to optimize this objective, called Actor-Expert. The Expert uses Qlearning to update the action-values towards optimal action-values. The Actor learns the maximal actions over time for these changing action-values. We develop a Cross Entropy Method (CEM) for the Actor, where such a global optimization approach facilitates use of generically parameterized action-values. This method-which we call Conditional CEMiteratively concentrates density around maximal actions, conditioned on state. We prove that this algorithm tracks the expected CEM update, over states with changing actionvalues. We demonstrate in a toy environment that previous methods that restrict the action-value parameterization fail whereas Actor-Expert with a more general action-value parameterization succeeds. Finally, we demonstrate that Actor-Expert performs as well as or better than competitors on four benchmark continuous-action environments.

show abstract

Accelerated Gradient Temporal Difference Learning

Pan

White

2016

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yangchen Pan

Hill Climbing on Value Estimates for Search-control in Dyna

Organizing Experience: a Deeper Look at Replay Mechanisms for Sample-Based Planning in Continuous State Domains

Organizing Experience: A Deeper Look at Replay Mechanisms for Sample-based Planning in Continuous State Domains

Greedy Actor-Critic: A New Conditional Cross-Entropy Method for Policy Improvement

Accelerated Gradient Temporal Difference Learning

Contact Info

Product

Resources

About