Reliability-Based Reinforcement Learning Under Uncertainty

Wang, Zequn; Patwardhan, Narendra

doi:10.1115/1.0001814v

Cited by 2 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• The above definition of robust Q-values reduces to known values of Q-values defined in sa-rectangular R-contamination uncertainty set in [25] and sa-rectangular L p constrained uncertainty set in [12,4].…”

Section: Discussionmentioning

confidence: 99%

“…Without some structural assumptions on the uncertainty set, solving robust MDPs can be NP-hard [28]. Therefore, to preserve tractability, we often assume that the uncertainty set is convex and s-rectangular, that is , it can be expressed as a Cartesian product over states [18,7,28,4,12,25]. In that case, standard solvers for MDPs carry over to robust MDPs.…”

Section: Introductionmentioning

confidence: 99%

“…Classical methods involve solving successive max-min problems which can be computationally expensive [10,6,2,28,17,8]. Recently, efficient robust value iteration methods have been proposed to solve s and sa-rectangular robust MDPs [25,4,12]. However, their scope remains limited, either because of the restrictive uncertainty set they consider [25] or because of their limitation to planning [4,12].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Policy Gradient for s-Rectangular Robust Markov Decision Processes

Kumar¹,

Derman²,

Geist³

et al. 2023

Preprint

View full text Add to dashboard Cite

We present a novel robust policy gradient method (RPG) for s-rectangular robust Markov Decision Processes (MDPs). We are the first to derive the adversarial kernel in a closed form and demonstrate that it is a one-rank perturbation of the nominal kernel. This allows us to derive an RPG that is similar to the one used in non-robust MDPs, except with a robust Q-value function and an additional correction term. Both robust Q-values and correction terms are efficiently computable, thus the time complexity of our method matches that of non-robust MDPs, which is significantly faster compared to existing black box methods.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Policy Gradient for s-Rectangular Robust Markov Decision Processes

Kumar¹,

Derman²,

Geist³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Ghugare et al (2022) optimize an objective for learning a latent-space model and policy jointly, aiming to maximize a lower bound on the expected on the overall RL objective. An et al (2021) propose uncertainty-based methods to guide the Q-value function update using the data with high confidence. Others address this problem by imitating experts (Zolna et al , 2020) or learning ensembles (Agarwal et al , 2020).…”

Section: Related Workmentioning

confidence: 99%

“…Offline RL, as known as batch RL, plays an appealing alternative role (An et al , 2021; Fujimoto and Gu, 2021; Fujimoto et al , 2019). In direct contrast to online RL, offline RL acquires effective policies by using prior collected large-scale data, without online interaction during training.…”

Section: Introductionmentioning

confidence: 99%

Efficient experience replay architecture for offline reinforcement learning

Zhang

Feng

Wang

et al. 2023

RIA

View full text Add to dashboard Cite

Purpose Offline reinforcement learning (RL) acquires effective policies by using prior collected large-scale data, while, in some scenarios, collecting data may be hard because it is time-consuming, expensive and dangerous, i.e. health care, autonomous driving, seeking a more efficient offline RL method. The purpose of the study is to introduce an algorithm, which attempts to sample the high-value transitions in the prioritized buffer, and uniformly sample from the normal experience buffer, improving sample efficiency of offline reinforcement learning, as well as alleviating the “extrapolation error” commonly arising in offline RL. Design/methodology/approach The authors propose a new structure of experience replay architecture, which consists of double experience replies, a prioritized experience replay and a normal experience replay, supplying samples for policy updates in different training phases. At the first training stage, the authors sample from prioritized experience replay according to the calculated priority of each transitions. At the second training stage, the authors sample from the normal experience replay uniformly. The combination of the two experience replies is initialized by the same offline data set. Findings The proposed method eliminates out-of-distribution problem in an offline RL regime, and promotes training by leveraging a new efficient experience replay. The authors evaluate their method on D4RL benchmark, and the results reveal that the algorithm can achieve superior performance over the state-of-the-art offline RL algorithm. The ablation study proves that the authors’ experience replay architecture plays an important role in terms of improving final performance, data-efficiency and training stability. Research limitations/implications Because of the extra addition of prioritized experience replay, the proposed method increases the computational burden and has the risk of changing data distribution due to the combined sample strategy. Therefore, researchers are encouraged to use the experience replay block effectively and efficiently further. Practical implications Offline RL is susceptible to the quality and coverage of pre-collected data, which may be not easy to be collected from specific environment, demanding practitioners to handcraft behavior policy to interact with environment for gathering data. Originality/value The proposed approach focuses on the experience replay architecture for offline RL, and empirically demonstrates the superiority of the algorithm on data efficiency and final performance over conservative Q-learning across diverse D4RL tasks. In particular, the authors compare different variants of their experience replay block, and the experiments show that the stages, when to sample from the priority buffer, play an important role in the algorithm. The algorithm is easy to implement and can be combined with any Q-value approximation-based offline RL methods by minor adjustment.

show abstract

Reliability-Based Reinforcement Learning Under Uncertainty

Cited by 2 publications

References 0 publications

Policy Gradient for s-Rectangular Robust Markov Decision Processes

Policy Gradient for s-Rectangular Robust Markov Decision Processes

Efficient experience replay architecture for offline reinforcement learning

Contact Info

Product

Resources

About