Pseudo-MDPs and factored linear action models

Yao, Hengshuai; Szepesvári, Csaba; Pires, Bernardo Avila; Zhang, Xinhua

doi:10.1109/adprl.2014.7010633

Cited by 11 publications

(22 citation statements)

References 15 publications

(12 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It estimates the value function directly from a sensed experience (Sutton and Barto 2018). On the other hand, the model-based RL approach uses an estimated transition function to compute the optimal policy (Yao and Szepesvári 2012;Yao et al 2014;Sutton and Barto 2018;Moerland, Broekens, and Jonker 2020). A model-based RL method usually has a planning component, which learns and uses a model to approximate value functions.…”

Section: Focusing On Model-based Reinforcement Learningmentioning

confidence: 99%

Towards Safe, Explainable, and Regulated Autonomous Driving

Atakishiyev¹,

Salameh²,

Yao³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

There has been growing interest in the development and deployment of autonomous vehicles on modern road networks over the last few years, encouraged by the empirical successes of powerful artificial intelligence approaches (AI), especially in the applications of deep and reinforcement learning. However, there have been several road accidents with "autonomous" cars that prevent this technology from being publicly acceptable at a wider level. As AI is the main driving force behind the intelligent navigation systems of such vehicles, both the stakeholders and transportation jurisdictions require their AI-driven software architecture to be safe, explainable, and regulatory compliant. We present a framework that integrates autonomous control, explainable AI architecture, and regulatory compliance to address this issue and further provide several conceptual models from this perspective, to help guide future research directions.

show abstract

Section: Focusing On Model-based Reinforcement Learningmentioning

confidence: 99%

Towards Safe, Explainable, and Regulated Autonomous Driving

Atakishiyev¹,

Salameh²,

Yao³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Online learning under this assumption has received substantial attention in the recent literature, and in particular has been shown to be satisfied in the class of so-called linear MDPs studied by Jin et al [24], Cai et al [14] and low-rank MDPs studied by Yang and Wang [44], which are both special cases of factored linear models [45,34].…”

Section: Assumption 1 (Realizable Function Approximation) For Any H ∈...mentioning

confidence: 99%

Online learning in MDPs with linear function approximation and bandit feedback

Neu,

Olkhovskaya

2020

Preprint

View full text Add to dashboard Cite

We consider an online learning problem where the learner interacts with a Markov decision process in a sequence of episodes, where the reward function is allowed to change between episodes in an adversarial manner and the learner only gets to observe the rewards associated with its actions. We allow the state space to be arbitrarily large, but we assume that all action-value functions can be represented as linear functions in terms of a known low-dimensional feature map, and that the learner has access to a simulator of the environment that allows generating trajectories from the true MDP dynamics. Our main contribution is developing a computationally efficient algorithm that we call MDP-LINEXP3, and prove that its regret is bounded by O H 2 T 2/3 (dK) 1/3 , where T is the number of episodes, H is the number of steps in each episode, K is the number of actions, and d is the dimension of the feature map. We also show that the regret can be improved to O H 2 √ T dK under much stronger assumptions on the MDP dynamics. To our knowledge, MDP-LINEXP3 is the first provably efficient algorithm for this problem setting.

show abstract

“…This allows us to define the S × d feature matrix Φ with its x th row being ϕ T (x), and represent the action-value function as Q h,a = Φθ h,a . We make the following assumption: Assumption 1 (Factored linear MDP [49,37,25]). For each action a and stage h, there exists a d × S matrix M h,a and a vector ρ a such that the transition matrix can be written as P h,a = ΦM h,a , and the reward function as r a = Φρ a .…”

Section: Linear Function Approximation In Mdpsmentioning

confidence: 99%

A Unifying View of Optimism in Episodic Reinforcement Learning

Neu¹,

Pike-Burke²

2020

Preprint

View full text Add to dashboard Cite

The principle of "optimism in the face of uncertainty" underpins many theoretically successful reinforcement learning algorithms. In this paper we provide a general framework for designing, analyzing and implementing such algorithms in the episodic reinforcement learning problem. This framework is built upon Lagrangian duality, and demonstrates that every model-optimistic algorithm that constructs an optimistic MDP has an equivalent representation as a value-optimistic dynamic programming algorithm. Typically, it was thought that these two classes of algorithms were distinct, with model-optimistic algorithms benefiting from a cleaner probabilistic analysis while value-optimistic algorithms are easier to implement and thus more practical. With the framework developed in this paper, we show that it is possible to get the best of both worlds by providing a class of algorithms which have a computationally efficient dynamic-programming implementation and also a simple probabilistic analysis. Besides being able to capture many existing algorithms in the tabular setting, our framework can also address largescale problems under realizable function approximation, where it enables a simple model-based analysis of some recently proposed methods.

show abstract

Pseudo-MDPs and factored linear action models

Cited by 11 publications

References 15 publications

Towards Safe, Explainable, and Regulated Autonomous Driving

Towards Safe, Explainable, and Regulated Autonomous Driving

Online learning in MDPs with linear function approximation and bandit feedback

A Unifying View of Optimism in Episodic Reinforcement Learning

Contact Info

Product

Resources

About