Demystifying the Curse of Horizon in Offline Reinforcement Learning in Order to Break It Offline reinforcement learning (RL), where we evaluate and learn new policies using existing off-policy data, is crucial in applications where experimentation is challenging and simulation unreliable, such as medicine. It is also notoriously difficult because the similarity (density ratio) between observed trajectories and those generated by any new policy diminishes exponentially as the horizon grows, known as the curse of horizon, which severely limits the application of offline RL whenever horizons are moderately long or even infinite. In “Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning,” Kallus and Uehara set out to understand these limits and when they can be broken. They precisely characterize the curse by deriving the semiparametric efficiency lower bounds for the policy-value estimation problem in different models. On the one hand, this shows why the curse necessarily plagues standard estimators: they work even in non-Markov models and therefore must be limited by the corresponding bound. On the other hand, greater efficiency is possible in certain Markovian models, and they give the first estimator achieving these much lower efficiency bounds in infinite-horizon Markov decision processes.
This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE(Agarwal et al., 2020b)--the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB-Upper Confidence Bound driven REPresentation learning for RL, which significantly improves the sample complexity from O(A 9 d 7 /( 10 (1with d being the rank of the transition matrix (or dimension of the ground truth representation), A being the number of actions, and γ being the discount factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explorethen-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline data distribution.
We study model-based offline Reinforcement Learning with general function approximation. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO) which leverages a general function class and uses a constraint to encode pessimism. Under the assumption that the ground truth model belongs to our function class, CPPO can learn with the offline data only providing partial coverage, i.e., it can learn a policy that completes against any policy that is covered by the offline data, in polynomial sample complexity with respect to the statistical complexity of the function class. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where the additional structural assumptions can further refine the concept of partial coverage. One notable example is low-rank MDP with representation learning where the partial coverage is defined using the concept of relative condition number measured by the underlying unknown ground truth feature representation. Finally, we introduce and study the Bayesian setting in offline RL.The key benefit of Bayesian offline RL is that algorithmically, we do not need to explicitly construct pessimism or reward penalty which could be hard beyond models with linear structures. We present a posterior sampling based incremental policy optimization algorithm (PS-PO) which proceeds by iteratively sampling a model from the posterior distribution and performing one step incremental policy optimization inside the sampled model. Theoretically, in expectation with respect to the prior distribution, PS-PO can learn a near optimal policy under partial coverage with polynomial sample complexity.
We consider the efficient estimation of a low-dimensional parameter in the presence of very high-dimensional nuisances that may depend on the parameter of interest. An important example is the quantile treatment effect (QTE) in causal inference, where the efficient estimation equation involves as a nuisance the conditional cumulative distribution evaluated at the quantile to be estimated. Debiased machine learning (DML) is a data-splitting approach to address the need to estimate nuisances using flexible machine learning methods that may not satisfy strong metric entropy conditions, but applying it to problems with estimand-dependent nuisances would require estimating too many nuisances to be practical. For the QTE estimation, DML requires we learn the whole conditional cumulative distribution function, which may be challenging in practice and stands in contrast to only needing to estimate just two regression functions as in the efficient estimation of average treatment effects. Instead, we propose localized debiased machine learning (LDML), a new three-way data-splitting approach that avoids this burdensome step and needs only estimate the nuisances at a single initial bad guess for the parameters. In particular, under a Fréchet-derivative orthogonality condition, we show the oracle estimation equation is asymptotically equivalent to one where the nuisance is evaluated at the true parameter value and we provide a strategy to target this alternative formulation: construct an initial bad guess for the estimand using one third of the data, estimate the nuisances at this value using flexible machine learning methods using another third of the data, plug in these estimates and solve the estimation equation on the last third of data, repeat with the thirds permuted, and average the solutions. In the case of QTE estimation, this involves only learning two binary regression models, for which many standard, time-tested machine learning methods exist. We prove that under certain lax rate conditions, our estimator has the same favorable asymptotic behavior as the infeasible oracle estimator that solves the estimating equation with the true nuisance functions. Thus, our proposed approach uniquely enables practically-feasible efficient estimation of important quantities in causal inference and other missing data settings such as QTEs.
We study the causal inference when not all confounders are observed and instead negative controls are available. Recent work has shown how negative controls can enable identification and efficient estimation of average treatment effects via two so-called bridge functions. In this paper, we consider a generalized average causal effect (GACE) with general interventions (discrete or continuous) and tackle the central challenge to causal inference using negative controls: the identification and estimation of the two bridge functions. Previous work has largely relied on completeness assumptions for identification and uniqueness assumptions for estimation, and mainly focused on estimating these functions parametrically. We provide a new identification strategy for GACE that avoids completeness, and propose new minimax-learning estimators for the (nonnunique) bridge functions that can accommodate general function classes such as Reproducing Kernel Hilbert spaces and neural networks and can provide theoretical guarantees even when the bridge functions are nonnunique. We study finite-sample convergence results both for estimating bridge function themselves and for the final GACE estimator. We do this under a variety of combinations of assumptions on the hypothesis and critic classes employed in the minimax estimator. Depending on how much we are willing to assume, we obtain different convergence rates. In some cases, we show that the GACE estimator may converge to truth even when our minimax bridge function estimators do not converge to any valid bridge function. And, in other cases, we show we can obtain semiparametric efficiency.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.