General Characterization of Agents by States they Visit

Kanervisto, Anssi; Kinnunen, Tomi; Hautamäki, Ville

doi:10.48550/arxiv.2012.01244

Cited by 1 publication

(11 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To compare these policy abstractions in a quantitative view, we demonstrate how the distances of two policies measured by the corresponding policy metrics differ in several Gridworld MDPs. We borrow Distinct Policies, Doorway from [14] and design a new environment, Key Action for simple prototypes of environments with different features; moreover, we increase the stochasticity of the environment for a better evaluation as done in [14]. In particular, E s∼p(s) D(•, •) is calculated by average the absolute differences over all states.…”

Section: Empirical Comparison Of Policy Metrics In Gridworld Mdpsmentioning

confidence: 99%

“…3.2 in policy optimization below. To be specific, we consider two policy optimization problem settings: Trust-Region Policy Optimization (TRPO) and Diversity-Guided Evolutionary Strategy (DGES), as introduced in [14], covering both gradient-based and gradient-free policy optimization. Complete details of problem settings are provided in Appendix E.…”

Section: Applying Policy Abstraction To Policy Optimizationmentioning

confidence: 99%

“…Intuitively and naturally, such issues can be significantly alleviated if we have an ideal surrogate policy space, which are compact in scale while keep the key features of policy space. Related to this direction, low-dimensional latent representation of policy plays an important role in Reinforcement Learning (RL) [34], Opponent Modeling [8], Fast Adaptation [25,27], Behavioral Characterization [14] and etc. In these domains, a few preliminary attempts have been made in devising different policy representations.…”

Section: Introductionmentioning

confidence: 99%

“…Most policy representations introduced in prior works resort to encapsulating the information of policy distribution under interest states [11,22], e.g., learning policy embedding by encoding policy's state-action pairs (or trajectories) and optimizing a policy recovery objective [8,25]. Rather than policy distribution, some other works resort to the information of policy's influence on the environment, e.g., state(-action) visitation distribution induced by the policy [14,20]. Recently, Tang et al [34] offers several methods to learn policy representation through policy contrast or recovery from both policy network parameters and interaction experiences.…”

Section: Introductionmentioning

confidence: 99%

“…We conduct experiments in both policy optimization and policy evaluation problems. For policy optimization, we adopt Trust-Region Policy Optimization (TRPO) and Diversity-Guided Evolution Strategy (DGES) as the problem settings from [14], covering both gradient-based and gradient-free policy optimization. For policy evaluation, we consider Off-policy Evaluation (OPE).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Towards A Unified Policy Abstraction Theory and Representation Learning Approach in Markov Decision Processes

Zhang¹,

Tang²,

Hao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Lying on the heart of intelligent decision-making systems, how policy is represented and optimized is a fundamental problem. The root challenge in this problem is the large scale and the high complexity of policy space, which exacerbates the difficulty of policy learning especially in real-world scenarios. Towards a desirable surrogate policy space, recently policy representation in a low-dimensional latent space has shown its potential in improving both the evaluation and optimization of policy. The key question involved in these studies is by what criterion we should abstract the policy space for desired compression and generalization. However, both the theory on policy abstraction and the methodology on policy representation learning are less studied in the literature. In this work, we make very first efforts to fill up the vacancy. First, we propose a unified policy abstraction theory, containing three types of policy abstraction associated to policy features at different levels. Then, we generalize them to three policy metrics that quantify the distance (i.e., similarity) of policies, for more convenient use in learning policy representation. Further, we propose a policy representation learning approach based on deep metric learning. Following the principle of alignment, the representation of policy is learned by minimizing the difference between the distance of policy embeddings and the quantity measured with the policy metrics. For the empirical study, we investigate the efficacy of the proposed policy metrics and representations, in characterizing policy difference and conveying policy generalization respectively. Our experiments are conducted in both policy optimization and evaluation problems, containing trust-region policy optimization (TRPO), diversity-guided evolution strategy (DGES) and off-policy evaluation (OPE). Somewhat naturally, the experimental results indicate that there is no a universally optimal abstraction for all downstream learning problems; while the influence-irrelevance policy abstraction can be a generally preferred choice.

show abstract

Section: Empirical Comparison Of Policy Metrics In Gridworld Mdpsmentioning

confidence: 99%

Section: Applying Policy Abstraction To Policy Optimizationmentioning

confidence: 99%