Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces

Mahadevan, Sridhar; Liu, Bo; Thomas, Philip S.; Dabney, William; Giguere, Stephen; Jacek, Nicholas; Gemp, Ian; Liu, Ji

doi:10.48550/arxiv.1405.6757

Cited by 23 publications

(31 citation statements)

References 93 publications

(146 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The line of research reported here has much in common with work on proximal reinforcement learning [Mahadevan et al, 2014], which explores first-order reinforcement learning algorithms using mirror maps [Bubeck, 2014;Juditsky et al, 2008] to construct primal-dual spaces. This work began originally with a dual space formulation of first-order sparse TD learning .…”

Section: Antosmentioning

confidence: 99%

“…A sparse off-policy GTD2 algorithm with regularized dual averaging is introduced by Qin and Li [2014]. These studies provide different approaches to formulating the problem, first as a variational inequality problem [Juditsky et al, 2008;Mahadevan et al, 2014] or as a linear inverse problem , or as a quadratic objective function (MSPBE) using two-time-scale solvers [Qin and Li, 2014]. In this paper, we are going to explore the true nature of the GTD algorithms as stochastic gradient algorithm w.r.t the convex-concave saddle-point formulations of NEU and MSPBE.…”

Section: Antosmentioning

confidence: 99%

“…For simplicity, we only describe mirror-prox GTD methods where the mirror map is identity, which can also be viewed as extragradient (EG) GTD methods Mahadevan et al [2014]. gives a more detailed discussion of a broad range of mirror maps in RL.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Finite-Sample Analysis of Proximal Gradient TD Algorithms

Liu,

Ghavamzadeh

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we show for the first time how gradient TD (GTD) reinforcement learning methods can be formally derived as true stochastic gradient algorithms, not with respect to their original objective functions as previously attempted, but rather using derived primal-dual saddle-point objective functions. We then conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and no finite-sample analysis had been attempted. Two novel GTD algorithms are also proposed, namely projected GTD2 and GTD2-MP, which use proximal "mirror maps" to yield improved convergence guarantees and acceleration, respectively. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

show abstract

Section: Antosmentioning

confidence: 99%

Section: Antosmentioning

confidence: 99%

See 1 more Smart Citation

Finite-Sample Analysis of Proximal Gradient TD Algorithms

Liu,

Ghavamzadeh

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Introducing regularization in the GTD objective is not new. Mahadevan et al (2014) introduce the proximal GTD learning framework to integrate GTD algorithms with first-order optimization-based regularization via saddle-point formulations and proximal operators. Yu (2017) introduces a general regularization term for improving robustness.…”

Section: Gradient Emphasis Learningmentioning

confidence: 99%

Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation

Zhang,

Liu,

Yao

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

We present the first provably convergent twotimescale off-policy actor-critic algorithm (COF-PAC) with function approximation. Key to COF-PAC is the introduction of a new critic, the emphasis critic, which is trained via Gradient Emphasis Learning (GEM), a novel combination of the key ideas of Gradient Temporal Difference Learning and Emphatic Temporal Difference Learning. With the help of the emphasis critic and the canonical value function critic, we show convergence for COF-PAC, where the critics are linear and the actor can be nonlinear. † We use emphasis to denote the limit of the expectation of the followon trace, which is slightly different from Sutton et al. ( 2016) and is clearly defined in the next section.

show abstract

“…Here, the double sampling issue means the requirement of double samples of the next stats from the current state to obtain an unbiased stochastic estimate of gradients of the objective mainly due to its quadratic nonlinearity. Alternatively, [28], [39] get around this difficulty by resorting to min-max reformulations of the MSBE and MSBPE and introduce primal-dual type methods for policy evaluation with finite sample analysis. Similar ideas have also been employed for policy optimization based on the (softmax) Bellman optimality equation; see, e.g., [34] (called Smoothed Bellman Error Embedding (SBEED) algorithm).…”

Section: B Modern Optimization-based Rl Algorithmsmentioning

confidence: 99%

Optimization for Reinforcement Learning: From Single Agent to Cooperative Agents

Lee,

He,

Kamalaruban

et al. 2019

Preprint

View full text Add to dashboard Cite

This article reviews recent advances in multi-agent reinforcement learning algorithms for largescale control systems and communication networks, which learn to communicate and cooperate. We provide an overview of this emerging field, with an emphasis on the decentralized setting under different coordination protocols. We highlight the evolution of reinforcement learning algorithms from single-agent to multi-agent systems, from a distributed optimization perspective, and conclude with future directions and challenges, in the hope to catalyze the growing synergy among distributed optimization, signal processing, and reinforcement learning communities. I. INTRODUCTIONFueled with recent advances in deep neural networks, reinforcement learning (RL) has been in the limelight for many recent breakthroughs in artificial intelligence, including defeating humans in games (e.g., chess, Go, StarCraft), self-driving cars, smart home automation, service robots, among many others.Despite these remarkable achievements, many basic tasks can still elude a single RL agent. Examples abound from multi-player games, multi-robots, cellular antenna tilt control, traffic control systems, smart power grids to network management.Often, cooperation among multiple RL agents is much more critical: multiple agents must collaborate to complete a common goal, expedite learning, protect privacy, offer resiliency against failures and adversarial attacks, and overcome the physical limitations of a single RL agent behaving alone. These tasks are studied under the umbrella of cooperative multi-agent RL (MARL), where agents seek to learn optimal policies to maximize a shared team reward, while interacting with an unknown stochastic environment and with each other. Cooperative MARL is far more challenging than the single-agent case due to: i) the exponentially growing search space, ii) the non-stationary and unpredictable environment caused by

show abstract

Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces

Cited by 23 publications

References 93 publications

Finite-Sample Analysis of Proximal Gradient TD Algorithms

Finite-Sample Analysis of Proximal Gradient TD Algorithms

Provably Convergent Two-Timescale Off-Policy Actor-Critic with Function Approximation

Optimization for Reinforcement Learning: From Single Agent to Cooperative Agents

Contact Info

Product

Resources

About