What About Inputing Policy in Value Function: Policy Representation and Policy-extended Value Function Approximator

Tang, Hongyao; Meng, Zhaopeng; Hao, Jianye; Chen, Chen; Graves, Daniel; Li, Dong; Yu, Changmin; Mao, Hangyu; Liu, Wulong; Yang, Yaodong; Tao, Wenyuan; Wang, Li

doi:10.48550/arxiv.2010.09536

Cited by 3 publications

(13 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Harb et al (2020) uses the actions that policy sampled in probing states as policy representations. There are also some articles (Faccio et al, 2021;Tang et al, 2020) that use policy network parameters as policy representations.…”

Section: E2 Representation Learningmentioning

confidence: 99%

PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations

Sang¹,

Tang²,

Ma³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Deep Reinforcement Learning (DRL) has been a promising solution to many complex decision-making problems. Nevertheless, the notorious weakness in generalization among environments prevent widespread application of DRL agents in real-world scenarios. Although advances have been made recently, most prior works assume sufficient online interaction on training environments, which can be costly in practical cases. To this end, we focus on an offline-training-onlineadaptation setting, in which the agent first learns from offline experiences collected in environments with different dynamics and then performs online policy adaptation in environments with new dynamics. In this paper, we propose Policy Adaptation with Decoupled Representations (PAnDR) for fast policy adaptation. In offline training phase, the environment representation and policy representation are learned through contrastive learning and policy recovery, respectively. The representations are further refined by mutual information optimization to make them more decoupled and complete. With learned representations, a Policy-Dynamics Value Function (PDVF) (Raileanu et al., 2020) network is trained to approximate the values for different combinations of policies and environments. In online adaptation phase, with the environment context inferred from few experiences collected in new environments, the policy is optimized by gradient ascent with respect to the PDVF. Our experiments show that PAnDR outperforms existing algorithms in several representative policy adaptation problems.

show abstract

Section: E2 Representation Learningmentioning

confidence: 99%

PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations

Sang¹,

Tang²,

Ma³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Recent work (Tang et al, 2020) learned Parameter-Based State-Value Functions which, coupled with PPO, improved performance. The authors did not use the value function to directly backpropagate gradients through the policy parameters, but only exploited the general policy evaluation properties of the method.…”

Section: Related Workmentioning

confidence: 99%

General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States

Faccio¹,

Ramesh²,

Herrmann³

et al. 2022

Preprint

View full text Add to dashboard Cite

Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Networks to learn a single value function for evaluating (and thus helping to improve) any policy represented by a deep neural network (NN). The method yields competitive experimental results. In continuous control problems with infinitely many states, our value function minimizes its prediction error by simultaneously learning a small set of 'probing states' and a mapping from actions produced in probing states to the policy's return. The method extracts crucial abstract knowledge about the environment in form of very few states sufficient to fully specify the behavior of many policies. A policy improves solely by changing actions in probing states, following the gradient of the value function's predictions. Surprisingly, it is possible to clone the behavior of a near-optimal policy in Swimmer-v3 and Hopper-v3 environments only by knowing how to act in 3 and 5 such learned states, respectively. Remarkably, our value function trained to evaluate NN policies is also invariant to changes of the policy architecture: we show that it allows for zero-shot learning of linear policies competitive with the best policy seen during training. Our code is public. 2

show abstract

“…Intuitively and naturally, such issues can be significantly alleviated if we have an ideal surrogate policy space, which are compact in scale while keep the key features of policy space. Related to this direction, low-dimensional latent representation of policy plays an important role in Reinforcement Learning (RL) [34], Opponent Modeling [8], Fast Adaptation [25,27], Behavioral Characterization [14] and etc. In these domains, a few preliminary attempts have been made in devising different policy representations.…”

Section: Introductionmentioning

confidence: 99%

“…Rather than policy distribution, some other works resort to the information of policy's influence on the environment, e.g., state(-action) visitation distribution induced by the policy [14,20]. Recently, Tang et al [34] offers several methods to learn policy representation through policy contrast or recovery from both policy network parameters and interaction experiences. Put shortly, the key question of policy representation learning is by what criterion we should abstract the policy space for desired compression and generalization.…”

Section: Introductionmentioning

confidence: 99%

“…The policy representation is learned to render the abstraction criterion through minimizing the difference between the distance of policy embeddings and the quantity measure by the policy metrics. In particular, we use Maximum Mean Discrepancy (MMD) [7,21] for efficient empirical estimation of the policy metrics; and we adopt Layer-wise Permutation-invariant Encoder [34] for structure-aware encoding of the parameters of policy network.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes

Zhang¹,

Tang²,

Hao³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Lying on the heart of intelligent decision-making systems, how policy is represented and optimized is a fundamental problem. The root challenge in this problem is the large scale and the high complexity of policy space, which exacerbates the difficulty of policy learning especially in real-world scenarios. Towards a desirable surrogate policy space, recently policy representation in a low-dimensional latent space has shown its potential in improving both the evaluation and optimization of policy. The key question involved in these studies is by what criterion we should abstract the policy space for desired compression and generalization. However, both the theory on policy abstraction and the methodology on policy representation learning are less studied in the literature. In this work, we make very first efforts to fill up the vacancy. First, we propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels. Then, we generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies, for more convenient use in learning policy representation. Further, we propose a policy representation learning approach based on deep metric learning. Following the principle of alignment, the representation of policy is learned by minimizing the difference between the distance of policy embeddings and the quantity measured with the policy metrics. For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively. Our experiments are conducted in both policy optimization and evaluation problems, containing trust-region policy optimization (TRPO), diversity-guided evolution strategy (DGES) and off-policy evaluation (OPE). Somewhat naturally, the experimental results indicate that there is no a universally optimal abstraction for all downstream learning problems; while the influence-irrelevance policy abstraction can be a generally preferred choice.

show abstract

What About Inputing Policy in Value Function: Policy Representation and Policy-extended Value Function Approximator

Cited by 3 publications

References 9 publications

PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations

PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations

General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States

Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes

Contact Info

Product

Resources

About