A Robust Offline Reinforcement Learning Algorithm Based on Behavior Regularization Methods

Zhang, Yan; Mi, Qingwei

doi:10.1109/iaict55358.2022.9887435

Cited by 3 publications

(3 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is a growing number of results under partial coverage following the principle of pessimism in offline RL (Yu et al, 2020;Kidambi et al, 2020). In comparison to works that focus on tabular (Rashidinejad et al, 2021;Shi et al, 2022;Yin and Wang, 2021) or linear models (Jin et al, 2020;Chang et al, 2021;Zhang et al, 2022;Nguyen-Tang et al, 2022;Bai et al, 2022), our emphasis is on general function approximation (Jiang and Huang, 2020;Uehara and Sun, 2022;Xie et al, 2021;Zhan et al, 2022;Rashidinejad et al, 2022;Zanette and Wainwright, 2022). Among them, we specifically focus on model-free methods.…”

Section: Related Workmentioning

confidence: 99%

Refined Value-Based Offline RL under Realizability and Partial Coverage

Uehara¹,

Kallus²,

Lee³

et al. 2023

Preprint

View full text Add to dashboard Cite

In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms to accurately estimate soft or vanilla Q-functions with L 2 -convergence guarantees. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.

show abstract

Section: Related Workmentioning

confidence: 99%

Refined Value-Based Offline RL under Realizability and Partial Coverage

Uehara¹,

Kallus²,

Lee³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…We now proceed to bound (29). It is worth noting that both f t and (s t , a t ) depend on s 0 , a 0 , s 1 , .…”

Section: Lemmamentioning

confidence: 99%

The Efficacy of Pessimism in Asynchronous Q-Learning

Yao

Chen

et al. 2023

IEEE Trans. Inform. Theory

View full text Add to dashboard Cite

This paper is concerned with the asynchronous form of Q-learning, which applies a stochastic approximation scheme to Markovian data samples. Motivated by the recent advances in offline reinforcement learning, we develop an algorithmic framework that incorporates the principle of pessimism into asynchronous Q-learning, which penalizes infrequently-visited state-action pairs based on suitable lower confidence bounds (LCBs). This framework leads to, among other things, improved sample efficiency and enhanced adaptivity in the presence of near-expert data. Our approach permits the observed data in some important scenarios to cover only partial state-action space, which is in stark contrast to prior theory that requires uniform coverage of all state-action pairs. When coupled with the idea of variance reduction, asynchronous Q-learning with LCB penalization achieves near-optimal sample complexity, provided that the target accuracy level is small enough. In comparison, prior works were suboptimal in terms of the dependency on the effective horizon even when i.i.d. sampling is permitted. Our results deliver the first theoretical support for the use of pessimism principle in the presence of Markovian non-i.i.d. data.

show abstract

“…Defenses against Adversarial Attacks on RL2.3.2.4 Defenses against Data CorruptionsZhang et al[81] investigate offline-RL's robustness when data corruption occurs.The authors examine the situation where an adversary can modify the ϵ fraction of a batch dataset composed of tuples (s, a, s ′ , r), with the objective of enabling an agent to learn a policy that is near optimal. Through theoretical analysis, they propose robust variants of the Least Square Value Iteration algorithms as well as provide general robustness bounds for RL.…”

mentioning

confidence: 99%

Environment poisoning in reinforcement learning: attacks and resilience

Xu¹

View full text Add to dashboard Cite

Upon finishing my Ph.D. thesis, gratitude is the strongest feeling inside of me. I would like to express my sincere appreciation to all those who have provided me with invaluable help in this process. I wish to express my greatest gratitude to my supervisor, Prof. Zinovi Rabinovich, not only for his expert guidance and scholarly advice on my research work, but also for his strong support and encouragement throughout my Ph.D. journey. His insightful comments and critical thinking always inspire me in tackling research problems. His attitude towards the academic research provides me with a good model of what a true researcher should be. Without his patient instruction and constant encouragement, I could not fulfil these research works and complete the Ph.D. study. I would like to thank all talented members in Prof. Rabinovich's team, including Rundong Wang, Ridhima Bector and Wei Qiu, for the kind help and support that made my study in NTU a wonderful time. Besides, I would like to thank my collaborators, Xinghua Qu and Lev Raizman, for their valuable feedback and precious suggestions on my research studies. Many thanks to my friends, Yanling Li and Shuyang Ding, for the joy and the encouragement they gave me. I would like to thank my Thesis Advisory Committee (TAC) members, Prof. Sinno Jialin Pan and Dr. Fedor Duzhin, for their insightful comments and suggestions on my research work. I also want to thank Mr. Kesavan Asaithambi for his support in Computational Intelligence Lab.Most importantly, I would like to thank my parents, Zhaoxin Xu and Jiaojun Han, for their unconditional love and support throughout my life. I would like to express special thanks to my husband and best friend, Hantao Huang, for always being there for me and supporting me at all times. Having you in my life makes me feel incredibly lucky every single day. This thesis is dedicated to all of you.

show abstract

A Robust Offline Reinforcement Learning Algorithm Based on Behavior Regularization Methods

Cited by 3 publications

References 5 publications

Refined Value-Based Offline RL under Realizability and Partial Coverage

Refined Value-Based Offline RL under Realizability and Partial Coverage

The Efficacy of Pessimism in Asynchronous Q-Learning

Environment poisoning in reinforcement learning: attacks and resilience

Contact Info

Product

Resources

About