2009 International Conference on Game Theory for Networks 2009
DOI: 10.1109/gamenets.2009.5137416
|View full text |Cite
|
Sign up to set email alerts
|

Online learning in Markov decision processes with arbitrarily changing rewards and transitions

Abstract: We consider decision-making problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., non-stationary) fashion. We present algorithms that combine online learning and robust control, and establish guarantees on their performance evaluated in retrospect against alternative policies-i.e., their regret. These guarantees depend critically on the range of uncertainty in the transition probabilities, but hold regardless of the changes in rewards and tran… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2009
2009
2022
2022

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 20 publications
(13 citation statements)
references
References 23 publications
0
13
0
Order By: Relevance
“…We also mention here that Yu and Mannor [20,21] considered the related problem of online learning in MDPs where the transition probabilities may also change arbitrarily after each transition. This problem is significantly more difficult than the case where only the reward function is allowed to change.…”
Section: Introductionmentioning
confidence: 99%
“…We also mention here that Yu and Mannor [20,21] considered the related problem of online learning in MDPs where the transition probabilities may also change arbitrarily after each transition. This problem is significantly more difficult than the case where only the reward function is allowed to change.…”
Section: Introductionmentioning
confidence: 99%
“…I n particular, at each time step n ∈ {1, 2, ...}, the server picks a caching action a n from the action space A given the current state of the system g n ∈ G, which is the state space. Given the current action and state pair (g n , a n ), the server moves to some state g with the probability of Pr(g |g n , a n ) and receives a reward r n (g n , a n ) [28]. In the following, the action space, states, transition probabilities, and reward function are defined.…”
Section: System Modelmentioning
confidence: 99%
“…Non-stationarity has attracted interest in the multi-armed bandit (MAB) and RL literature. Prior RL works typically deal with general non-stationary environments, e.g., (Goldberg and Matarić 2003;da Silva et al 2006;Yu and Mannor 2009;Al-Shedivat et al 2018). A relevant approach for efficient policy adaptation is to meta-learn the changes in the reward and update the policy accordingly (Al-Shedivat et al 2018).…”
Section: Non-stationary Rewardsmentioning
confidence: 99%