Policy space identification in configurable environments

Metelli, Alberto Maria; Manneschi, Guglielmo; Restelli, Marcello

doi:10.1007/s10994-021-06033-3

Cited by 14 publications

(31 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also demonstrate in our experiments that our sample dropout technique can boost the sample efficiency of GAE-based policy optimization algorithms. [28] and [29] propose an actor-only policy optimization algorithm that alternates online and offline optimization via important sampling. To capture the uncertainty induced by importance sampling, they propose a surrogate objective function derived from a statistical bound on the estimated performance, which helps bound the variance of surrogate objective in terms of the Renyi divergence.…”

Section: Variance Reduction In Policy Gradientmentioning

confidence: 99%

Sample Dropout: A Simple yet Effective Variance Reduction Technique in Deep Policy Optimization

Lin¹,

Wu²,

Sun³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recent success in Deep Reinforcement Learning (DRL) methods has shown that policy optimization with respect to an off-policy distribution via importance sampling is effective for sample reuse. In this paper, we show that the use of importance sampling could introduce high variance in the objective estimate. Specifically, we show in a principled way that the variance of importance sampling estimate grows quadratically with importance ratios and the large ratios could consequently jeopardize the effectiveness of surrogate objective optimization. We then propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high. We instantiate this sample dropout technique on representative policy optimization algorithms, including TRPO, PPO, and ESPO, and demonstrate that it consistently boosts the performance of those DRL algorithms on both continuous and discrete action controls, including MuJoCo, DMControl and Atari video games. Our code is open-sourced at https://github.com/LinZichuan/sdpo.git.

show abstract

Section: Variance Reduction In Policy Gradientmentioning

confidence: 99%

Sample Dropout: A Simple yet Effective Variance Reduction Technique in Deep Policy Optimization

Lin¹,

Wu²,

Sun³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Conf-MDPs have been introduced in [7] for finite spaces, and extended in [9] for more complex continuous environments. In these seminal works, the agent is fully responsible for the configuration activity of the environment, which, in turn, results in an auxiliary task to optimize performance.…”

Section: Related Workmentioning

confidence: 99%

“…Indeed, in Conf-MDPs, the agent is not interested in learning and gathering experience samples in sub-optimal configuration; its interest is solely toward the optimal policy in the optimal environmental configuration. The configuration activity within the environment, as shown in more recent works [8], [11], can also be carried out by an external entity (i.e., configurator) whose goals can even be adversary w.r.t. the ones of the agent [11].…”

Section: Related Workmentioning

confidence: 99%

“…More specifically, from an agents' perspective, the fleet of vessels can be seen as features of the environment that can be optimized to reach higher performances. In this sense, for single-agent problems, Configurable Markov Decision Processes (Conf-MDPs) [7]- [9] have recently been introduced to extend the Markov Decision Process (MDP) [10] framework to account for environmental configurations. In Conf-MDPs, an agent and a configurator are responsible for finding the optimal policy-configuration pair.…”

Section: Introductionmentioning

confidence: 99%

“…This is clearly related to our application scenario: our agents are in charge of deciding which container repositioning policy to play, whereas the configurator is entitled to select the fleet of vessels. While the early works [7]- [9] focused on the case in which agent and configurator share the same objective, in [11], the setting has been extended to the case in which the configurator and the agent have different (and, possibly, adversarial) goals. Although these approaches have strong theoretical guarantees, how to successfully scale them to more complex domains remains an open question.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Optimizing Empty Container Repositioning and Fleet Deployment via Configurable Semi-POMDPs

Poiani¹,

Ciprian²,

Metelli³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

With the continuous growth of the global economy and markets, resource imbalance has risen to be one of the central issues in real logistic scenarios. In marine transportation, this trade imbalance leads to Empty Container Repositioning (ECR) problems. Once the freight has been delivered from an exporting country to an importing one, the laden will turn into empty containers that need to be repositioned to satisfy new goods requests in exporting countries. In such problems, the performance that any cooperative repositioning policy can achieve strictly depends on the routes that vessels will follow (i.e., fleet deployment). Historically, Operation Research (OR) approaches were proposed to jointly optimize the repositioning policy along with the fleet of vessels. However, the stochasticity of future supply and demand of containers, together with black-box and non-linear constraints that are present within the environment, make these approaches unsuitable for these scenarios. In this paper, we introduce a novel framework, Configurable Semi-POMDPs, to model this type of problems. Furthermore, we provide a two-stage learning algorithm, "Configure & Conquer" (CC), that first configures the environment by finding an approximation of the optimal fleet deployment strategy, and then "conquers" it by learning an ECR policy in this tuned environmental setting. We validate our approach in large and real-world instances of the problem. Our experiments highlight that CC avoids the pitfalls of OR methods and that it is successful at optimizing both the ECR policy and the fleet of vessels, leading to superior performance in world trade environments.

show abstract

Usage of machine learning methods for early detection of money laundering schemes

Domashova

Mikhailina

2021

Procedia Computer Science

View full text Add to dashboard Cite

Policy space identification in configurable environments

Cited by 14 publications

References 20 publications

Sample Dropout: A Simple yet Effective Variance Reduction Technique in Deep Policy Optimization

Sample Dropout: A Simple yet Effective Variance Reduction Technique in Deep Policy Optimization

Optimizing Empty Container Repositioning and Fleet Deployment via Configurable Semi-POMDPs

Usage of machine learning methods for early detection of money laundering schemes

Contact Info

Product

Resources

About