2010
DOI: 10.1109/tip.2009.2035228
|View full text |Cite
|
Sign up to set email alerts
|

Online Reinforcement Learning for Dynamic Multimedia Systems

Abstract: In our previous work, we proposed a systematic cross-layer framework for dynamic multimedia systems, which allows each layer to make autonomous and foresighted decisions that maximize the system's long-term performance, while meeting the application's real-time delay constraints. The proposed solution solved the cross-layer optimization offline, under the assumption that the multimedia system's probabilistic dynamics were known a priori, by modeling the system as a layered Markov decision process. In practice,… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

1
31
0

Year Published

2011
2011
2022
2022

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 20 publications
(32 citation statements)
references
References 28 publications
1
31
0
Order By: Relevance
“…Sarsa( λ ) is an eligibility trace [21] version of Sarsa. The update of Q π ( s , a ) [22] depends on Qt+1(s,a)=Qt(s,a)+αδtet(s,a), where δ t = r t +1 + γQ t ( s t +1 , a t +1 ) − Q t ( s t , a t ) and et(s,a)={γλetnormal1(s,a)+normal1ifs=st,a=atγλetnormal1(s,a)otherwise. …”
Section: Temporal Difference Learningmentioning
confidence: 99%
“…Sarsa( λ ) is an eligibility trace [21] version of Sarsa. The update of Q π ( s , a ) [22] depends on Qt+1(s,a)=Qt(s,a)+αδtet(s,a), where δ t = r t +1 + γQ t ( s t +1 , a t +1 ) − Q t ( s t , a t ) and et(s,a)={γλetnormal1(s,a)+normal1ifs=st,a=atγλetnormal1(s,a)otherwise. …”
Section: Temporal Difference Learningmentioning
confidence: 99%
“…PDS learning uses a sample average of the PDS value function to approximate given the experience tuple in each time slot: (9) where is a time-varying learning rate parameter.…”
Section: The Post-decision State Learning Algorithmmentioning
confidence: 99%
“…and ) in (8). Second, updating a single PDS using (9) provides information about the state-value function at many states. This is evident from the expected PDS value function on the right-hand-side of (8).…”
Section: The Post-decision State Learning Algorithmmentioning
confidence: 99%
See 1 more Smart Citation
“…On-line machine learning based power management has been recently practiced [26], [27], [28], [29], [30], [31]. For example, Q-learning [32], [33] can be utilized to find an optimal action-selection policy from set of states.…”
mentioning
confidence: 99%