Abstract:We study the problem of identifying the policy space available to an agent in a learning process, having access to a set of demonstrations generated by the agent playing the optimal policy in the considered space. We introduce an approach based on frequentist statistical testing to identify the set of policy parameters that the agent can control, within a larger parametric policy space. After presenting two identification rules (combinatorial and simplified), applicable under different assumptions on the polic… Show more
“…We also demonstrate in our experiments that our sample dropout technique can boost the sample efficiency of GAE-based policy optimization algorithms. [28] and [29] propose an actor-only policy optimization algorithm that alternates online and offline optimization via important sampling. To capture the uncertainty induced by importance sampling, they propose a surrogate objective function derived from a statistical bound on the estimated performance, which helps bound the variance of surrogate objective in terms of the Renyi divergence.…”
Section: Variance Reduction In Policy Gradientmentioning
Recent success in Deep Reinforcement Learning (DRL) methods has shown that policy optimization with respect to an off-policy distribution via importance sampling is effective for sample reuse. In this paper, we show that the use of importance sampling could introduce high variance in the objective estimate. Specifically, we show in a principled way that the variance of importance sampling estimate grows quadratically with importance ratios and the large ratios could consequently jeopardize the effectiveness of surrogate objective optimization. We then propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high. We instantiate this sample dropout technique on representative policy optimization algorithms, including TRPO, PPO, and ESPO, and demonstrate that it consistently boosts the performance of those DRL algorithms on both continuous and discrete action controls, including MuJoCo, DMControl and Atari video games. Our code is open-sourced at https://github.com/LinZichuan/sdpo.git.
“…We also demonstrate in our experiments that our sample dropout technique can boost the sample efficiency of GAE-based policy optimization algorithms. [28] and [29] propose an actor-only policy optimization algorithm that alternates online and offline optimization via important sampling. To capture the uncertainty induced by importance sampling, they propose a surrogate objective function derived from a statistical bound on the estimated performance, which helps bound the variance of surrogate objective in terms of the Renyi divergence.…”
Section: Variance Reduction In Policy Gradientmentioning
Recent success in Deep Reinforcement Learning (DRL) methods has shown that policy optimization with respect to an off-policy distribution via importance sampling is effective for sample reuse. In this paper, we show that the use of importance sampling could introduce high variance in the objective estimate. Specifically, we show in a principled way that the variance of importance sampling estimate grows quadratically with importance ratios and the large ratios could consequently jeopardize the effectiveness of surrogate objective optimization. We then propose a technique called sample dropout to bound the estimation variance by dropping out samples when their ratio deviation is too high. We instantiate this sample dropout technique on representative policy optimization algorithms, including TRPO, PPO, and ESPO, and demonstrate that it consistently boosts the performance of those DRL algorithms on both continuous and discrete action controls, including MuJoCo, DMControl and Atari video games. Our code is open-sourced at https://github.com/LinZichuan/sdpo.git.
“…Conf-MDPs have been introduced in [7] for finite spaces, and extended in [9] for more complex continuous environments. In these seminal works, the agent is fully responsible for the configuration activity of the environment, which, in turn, results in an auxiliary task to optimize performance.…”
Section: Related Workmentioning
confidence: 99%
“…Indeed, in Conf-MDPs, the agent is not interested in learning and gathering experience samples in sub-optimal configuration; its interest is solely toward the optimal policy in the optimal environmental configuration. The configuration activity within the environment, as shown in more recent works [8], [11], can also be carried out by an external entity (i.e., configurator) whose goals can even be adversary w.r.t. the ones of the agent [11].…”
Section: Related Workmentioning
confidence: 99%
“…More specifically, from an agents' perspective, the fleet of vessels can be seen as features of the environment that can be optimized to reach higher performances. In this sense, for single-agent problems, Configurable Markov Decision Processes (Conf-MDPs) [7]- [9] have recently been introduced to extend the Markov Decision Process (MDP) [10] framework to account for environmental configurations. In Conf-MDPs, an agent and a configurator are responsible for finding the optimal policy-configuration pair.…”
Section: Introductionmentioning
confidence: 99%
“…This is clearly related to our application scenario: our agents are in charge of deciding which container repositioning policy to play, whereas the configurator is entitled to select the fleet of vessels. While the early works [7]- [9] focused on the case in which agent and configurator share the same objective, in [11], the setting has been extended to the case in which the configurator and the agent have different (and, possibly, adversarial) goals. Although these approaches have strong theoretical guarantees, how to successfully scale them to more complex domains remains an open question.…”
With the continuous growth of the global economy and markets, resource imbalance has risen to be one of the central issues in real logistic scenarios. In marine transportation, this trade imbalance leads to Empty Container Repositioning (ECR) problems. Once the freight has been delivered from an exporting country to an importing one, the laden will turn into empty containers that need to be repositioned to satisfy new goods requests in exporting countries. In such problems, the performance that any cooperative repositioning policy can achieve strictly depends on the routes that vessels will follow (i.e., fleet deployment). Historically, Operation Research (OR) approaches were proposed to jointly optimize the repositioning policy along with the fleet of vessels. However, the stochasticity of future supply and demand of containers, together with black-box and non-linear constraints that are present within the environment, make these approaches unsuitable for these scenarios. In this paper, we introduce a novel framework, Configurable Semi-POMDPs, to model this type of problems. Furthermore, we provide a two-stage learning algorithm, "Configure & Conquer" (CC), that first configures the environment by finding an approximation of the optimal fleet deployment strategy, and then "conquers" it by learning an ECR policy in this tuned environmental setting. We validate our approach in large and real-world instances of the problem. Our experiments highlight that CC avoids the pitfalls of OR methods and that it is successful at optimizing both the ECR policy and the fleet of vessels, leading to superior performance in world trade environments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.