2009
DOI: 10.1287/moor.1090.0396
|View full text |Cite
|
Sign up to set email alerts
|

Online Markov Decision Processes

Abstract: We consider a Markov decision process (MDP) setting in which the reward function is allowed to change after each time step (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well an agent can do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

8
279
1

Year Published

2009
2009
2022
2022

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 139 publications
(288 citation statements)
references
References 18 publications
8
279
1
Order By: Relevance
“…Our setting might seem reminiscent of (online) reinforcement learning models, and in particular, of online Markov Decision Processes (MDPs) (Even-Dar et al, 2009;Neu et al, 2010). In these models, there is typically a finite number of states, and the player's actions on each round cause him to transition from one state to another.…”
Section: Discussion and Related Workmentioning
confidence: 99%
“…Our setting might seem reminiscent of (online) reinforcement learning models, and in particular, of online Markov Decision Processes (MDPs) (Even-Dar et al, 2009;Neu et al, 2010). In these models, there is typically a finite number of states, and the player's actions on each round cause him to transition from one state to another.…”
Section: Discussion and Related Workmentioning
confidence: 99%
“…EA-EMT is unlike other adaptive control algorithms based on expert ensembles, where experts directly produce actions or plans to be fused (e.g. [27,28,16]). 3 Rather, EA-EMT operates in two distinct modules: the expert-based model estimation and a control algorithm that utilises that model.…”
Section: Discussionmentioning
confidence: 99%
“…The main lesson here is that off-line planning in the worst-case can scale exponentially with the dimensionality of the state space (Chow and Tsitsiklis, 1989), while online planning (i.e., planning for the "current state") can break the curse of dimensionality by amortizing the planning effort over multiple time steps (Rust, 1996;Szepesvári, 2001). Other topics of interest include the linear programming-based approaches (de Farias and Van Roy, 2003, 2006, dual dynamic programming (Wang et al, 2008), techniques based on sample average approximation (Shapiro, 2003) such as PEGASUS (Ng and Jordan, 2000), online learning in MDPs with arbitrary reward processes (Even-Dar et al, 2005;Neu et al, 2010), or learning with (almost) no restrictions in a competitive framework (Hutter, 2004). Other important topics include learning and acting in partially observed MDPs (for recent developments, see, e.g., Littman et al, 2001;Toussaint et al, 2008;, learning and acting in games or under some other optimization criteria (Littman, 1994;Heger, 1994;Szepesvári and Littman, 1999;Borkar and Meyn, 2002), or the development of hierarchical and multi-time-scale methods (Dietterich, 1998;Sutton et al, 1999b).…”
Section: Further Readingmentioning
confidence: 99%