Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

Wen, Zheng; Roy, Benjamin Van

doi:10.1287/moor.2016.0826

Cited by 25 publications

(43 citation statements)

References 25 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These assumptions remain much stronger than the realizability assumption considered herein, where only the optimal Q-function Q is assumed to be linearly representable. Wen and Van Roy (2017) showed that sample-efficient RL is feasible in deterministic systems, which has been extended to stochastic systems with low variance in Du et al (2020b) under additional gap assumptions. In addition, Weisz et al (2021b) established exponential sample complexity lower bounds under the generative model when only Q is linearly realizable; their construction critically relied on making the action set exponentially large.…”

Section: Additional Related Workmentioning

confidence: 96%

Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting

Li¹,

Chen²,

Chi³

et al. 2021

Preprint

View full text Add to dashboard Cite

Low-complexity models such as linear function representation play a pivotal role in enabling sampleefficient reinforcement learning (RL). The current paper pertains to a scenario with value-based linear representation, which postulates the linear realizability of the optimal Q-function (also called the "linear Q problem"). While linear realizability alone does not allow for sample-efficient solutions in general, the presence of a large sub-optimality gap is a potential game changer, depending on the sampling mechanism in use. Informally, sample efficiency is achievable with a large sub-optimality gap when a generative model is available, but is unfortunately infeasible when we turn to standard online RL settings.In this paper, we make progress towards understanding this linear Q problem by investigating a new sampling protocol, which draws samples in an online/exploratory fashion but allows one to backtrack and revisit previous states in a controlled and infrequent manner. This protocol is more flexible than the standard online RL setting, while being practically relevant and far more restrictive than the generative model. We develop an algorithm tailored to this setting, achieving a sample complexity that scales polynomially with the feature dimension, the horizon, and the inverse sub-optimality gap, but not the size of the state/action space. Our findings underscore the fundamental interplay between sampling protocols and low-complexity structural representation in RL.

show abstract

Section: Additional Related Workmentioning

confidence: 96%

Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting

Li¹,

Chen²,

Chi³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…This research was further extended to kernel and neural function approximation in the recent work of ; Wang et al (2020). Other approaches in this approximation setting are either computationally intractable (Krishnamurthy et al, 2016;Dann et al, 2018;Dong et al, 2020) or require strong assumptions on the transition model (Wen & Van Roy, 2017).…”

Section: Related Workmentioning

confidence: 99%

Provably Efficient Cooperative Multi-Agent Reinforcement Learning with Function Approximation

Dubey¹,

Pentland²

2021

Preprint

View full text Add to dashboard Cite

Reinforcement learning in cooperative multi-agent settings has recently advanced significantly in its scope, with applications in cooperative estimation for advertising, dynamic treatment regimes, distributed control, and federated learning. In this paper, we discuss the problem of cooperative multi-agent RL with function approximation, where a group of agents communicates with each other to jointly solve an episodic MDP. We demonstrate that via careful message-passing and cooperative value iteration, it is possible to achieve near-optimal no-regret learning even with a fixed constant communication budget. Next, we demonstrate that even in heterogeneous cooperative settings, it is possible to achieve Pareto-optimal no-regret learning with limited communication. Our work generalizes several ideas from the multi-agent contextual and multi-armed bandit literature to MDPs and reinforcement learning.

show abstract

“…There has been substantial recent theoretical interest in understanding the means by which we can avoid the curse of dimensionality and obtain sample-efficient reinforcement learning (RL) methods [Wen and Van Roy, 2017, Du et al, 2019c,b, Wang et al, 2019, Yang and Wang, 2019, Cai et al, 2020, Zanette et al, 2020, Zhou et al, 2020b,a, Modi et al, 2020, Ayoub et al, 2020. Here, the extant body of literature largely focuses on sufficient conditions for efficient reinforcement learning.…”

Section: Introductionmentioning

confidence: 99%

An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

Wang¹,

Wang²,

Kakade³

2021

Preprint

View full text Add to dashboard Cite

A fundamental question in the theory of reinforcement learning is: suppose the optimal Q-function lies in the linear span of a given d dimensional feature mapping, is sample-efficient reinforcement learning (RL) possible? The recent and remarkable result of resolved this question in the negative, providing an exponential (in d) sample size lower bound, which holds even if the agent has access to a generative model of the environment. One may hope that this information theoretic barrier for RL can be circumvented by further supposing an even more favorable assumption: there exists a constant suboptimality gap between the optimal Q-value of the best action and that of the second-best action (for all states). The hope is that having a large suboptimality gap would permit easier identification of optimal actions themselves, thus making the problem tractable; indeed, provided the agent has access to a generative model, sample-efficient RL is in fact possible with the addition of this more favorable assumption.This work focuses on this question in the standard online reinforcement learning setting, where our main result resolves this question in the negative: our hardness result shows that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal Qfunction. Perhaps surprisingly, this implies an exponential separation between the online RL setting and the generative model setting. Complementing our negative hardness result, we give two positive results showing that provably sample-efficient RL is possible either under an additional low-variance assumption or under a novel hypercontractivity assumption (both implicitly place stronger conditions on the underlying dynamics model).

show abstract

Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

Cited by 25 publications

References 25 publications

Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting

Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting

Provably Efficient Cooperative Multi-Agent Reinforcement Learning with Function Approximation

An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

Contact Info

Product

Resources

About