Balancing Constraints and Rewards with Meta-Gradient D4PG

Calian, Dan A.; Mankowitz, Daniel J.; Zahavy, Tom; Xu, Zhongwen; Oh, Junhyuk; Levine, Nir; Mann, Timothy

doi:10.48550/arxiv.2010.06324

Cited by 3 publications

(4 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In previous work, Calian et al (2020) tune the learning rate of the Lagrange multipliers to automatically turn some constraints into soft-constraints when the agent is not able to satisfy them after a given period of time. The bootstrap constraint instead allows us to start making some progress on the main task without turning our hard constraints into soft constraints.…”

Section: Bootstrap Constraintmentioning

confidence: 99%

“…Most of these works tackle CMDPs from the perspective of Safe RL, which seeks to minimize the total regret over the cost functions throughout training (Ray, Achiam, and Amodei 2019) and focus on the single-constraint case (Zhang, Vuong, and Ross 2020;Dalal et al 2018;Calian et al 2020) or aggregate various types of events under a single constraint (Stooke, Achiam, and Abbeel 2020;Ray, Achiam, and Amodei 2019). In this work, we focus our attention on the potential of CMDPs for precise and intuitive behavior specification and work on the problem of satisfying many constraints simultaneously.…”

Section: Related Work Constrained Reinforcement Learningmentioning

confidence: 99%

See 1 more Smart Citation

Direct Behavior Specification via Constrained Reinforcement Learning

Roy¹,

Girgis²,

Romoff³

et al. 2021

Preprint

View full text Add to dashboard Cite

The standard formulation of Reinforcement Learning lacks a practical way of specifying what are admissible and forbidden behaviors. Most often, practitioners go about the task of behavior specification by manually engineering the reward function, a counter-intuitive process that requires several iterations and is prone to reward hacking by the agent. In this work, we argue that constrained RL, which has almost exclusively been used for safe RL, also has the potential to significantly reduce the amount of work spent for reward specification in applied Reinforcement Learning projects. To this end, we propose to specify behavioral preferences in the CMDP framework and to use Lagrangian methods, which seek to solve a min-max problem between the agent's policy and the Lagrangian multipliers, to automatically weigh each of the behavioral constraints. Specifically, we investigate how CMDPs can be adapted in order to solve goal-based tasks while adhering to a set of behavioral constraints and propose modifications to the SAC-Lagrangian algorithm to handle the challenging case of several constraints. We evaluate this framework on a set of continuous control tasks relevant to the application of Reinforcement Learning for NPC design in video games.

show abstract

Section: Bootstrap Constraintmentioning

confidence: 99%

Section: Related Work Constrained Reinforcement Learningmentioning

confidence: 99%

Direct Behavior Specification via Constrained Reinforcement Learning

Roy¹,

Girgis²,

Romoff³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Learning RL policy under safety constraints [12,13,7] becomes an important topic in the community due to the safety concern in real-world applications. Many methods based on constrained optimization have been developed, such as the trust region methods [5], Lagrangian methods [5,6,14], barrier methods [15,16], Lyapunov methods [4,17], etc. Another direction is based on the safety critic, where an additional value estimator is learned to predict cost, apart from the primal critic estimating the discounted return [7,18].…”

Section: Related Workmentioning

confidence: 99%

Safe Driving via Expert Guided Policy Optimization

Peng¹,

Li²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

When learning common skills like driving, beginners usually have domain experts standing by to ensure the safety of the learning process. We formulate such learning scheme under the Expert-in-the-loop Reinforcement Learning where a guardian is introduced to safeguard the exploration of the learning agent. While allowing the sufficient exploration in the uncertain environment, the guardian intervenes under dangerous situations and demonstrates the correct actions to avoid potential accidents. Thus ERL enables both exploration and expert's partial demonstration as two training sources. Following such a setting, we develop a novel Expert Guided Policy Optimization (EGPO) method which integrates the guardian in the loop of reinforcement learning. The guardian is composed of an expert policy to generate demonstration and a switch function to decide when to intervene. Particularly, a constrained optimization technique is used to tackle the trivial solution that the agent deliberately behaves dangerously to deceive the expert into taking over. Offline RL technique is further used to learn from the partial demonstration generated by the expert. Safe driving experiments show that our method achieves superior training and test-time safety, outperforms baselines with a substantial margin in sample efficiency, and preserves the generalizabiliy to unseen environments in test-time. Demo video and source code are available at: https://decisionforce.github.io/EGPO/.

show abstract

“…A wide variety of constrained reinforcement learning frameworks are proposed to solve constrained MDPs (CMDPs) [43]. They either convert a CMDP into an unconstrained min-max problem by introducing Lagrangian multipliers [12,14,[44][45][46][47][48], or seek to obtain the optimal policy by directly solving constrained optimization problems [11,13,[18][19][20][49][50][51]. However, it is hard to scale these single-agent methods to our multi-agent setting due to computational inefficiency.…”

Section: Related Workmentioning

confidence: 99%

DeCOM: Decomposed Policy for Constrained Cooperative Multi-Agent Reinforcement Learning

Yang¹,

Ding²,

Jin

et al. 2021

Preprint

View full text Add to dashboard Cite

In recent years, multi-agent reinforcement learning (MARL) has presented impressive performance in various applications. However, physical limitations, budget restrictions, and many other factors usually impose constraints on a multi-agent system (MAS), which cannot be handled by traditional MARL frameworks. Specifically, this paper focuses on constrained MASes where agents work cooperatively to maximize the expected team-average return under various constraints on expected team-average costs, and develops a constrained cooperative MARL framework, named DeCOM, for such MASes. In particular, DeCOM decomposes the policy of each agent into two modules, which empowers information sharing among agents to achieve better cooperation. In addition, with such modularization, the training algorithm of DeCOM separates the original constrained optimization into an unconstrained optimization on reward and a constraints satisfaction problem on costs. DeCOM then iteratively solves these problems in a computationally efficient manner, which makes DeCOM highly scalable. We also provide theoretical guarantees on the convergence of DeCOM's policy update algorithm. Finally, we validate the effectiveness of DeCOM with various types of costs in both toy and large-scale (with 500 agents) environments.

show abstract

Balancing Constraints and Rewards with Meta-Gradient D4PG

Cited by 3 publications

References 7 publications

Direct Behavior Specification via Constrained Reinforcement Learning

Direct Behavior Specification via Constrained Reinforcement Learning

Safe Driving via Expert Guided Policy Optimization

DeCOM: Decomposed Policy for Constrained Cooperative Multi-Agent Reinforcement Learning

Contact Info

Product

Resources

About