Online Reinforcement Learning for Dynamic Multimedia Systems

Traffic problems often occur due to the traffic demands by the outnumbered vehicles on road. Maximizing traffic flow and minimizing the average waiting time are the goals of intelligent traffic control. Each junction wants to get larger traffic flow. During the course, junctions form a policy of coordination as well as constraints for adjacent junctions to maximize their own interests. A good traffic signal timing policy is helpful to solve the problem. However, as there are so many factors that can affect the traffic control model, it is difficult to find the optimal solution. The disability of traffic light controllers to learn from past experiences caused them to be unable to adaptively fit dynamic changes of traffic flow. Considering dynamic characteristics of the actual traffic environment, reinforcement learning algorithm based traffic control approach can be applied to get optimal scheduling policy. The proposed Sarsa(λ)-based real-time traffic control optimization model can maintain the traffic signal timing policy more effectively. The Sarsa(λ)-based model gains traffic cost of the vehicle, which considers delay time, the number of waiting vehicles, and the integrated saturation from its experiences to learn and determine the optimal actions. The experiment results show an inspiring improvement in traffic control, indicating the proposed model is capable of facilitating real-time dynamic traffic control.

show abstract

“…Sarsa( λ ) is an eligibility trace [21] version of Sarsa. The update of Q π ( s , a ) [22] depends on

\begin{matrix} Q_{t + 1} (s, a) = Q_{t} (s, a) + α δ_{t} e_{t} (s, a), \end{matrix}

where δ t = r t +1 + γQ t ( s t +1 , a t +1 ) − Q t ( s t , a t ) and

\begin{matrix} e_{t} (s, a) = {\begin{matrix} γ λ {e_{t}}_{- normal1} (s, a) + normal1 & if s = s_{t}, a = a_{t} \\ γ λ {e_{t}}_{- normal1} (s, a) & otherwise . \end{matrix} \end{matrix}

…”

Section: Temporal Difference Learningmentioning

confidence: 99%

A Sarsa(λ)-Based Control Model for Real-Time Traffic Light Coordination

Zhou

Zhu

Liu

et al. 2014

The Scientific World Journal

View full text Add to dashboard Cite

show abstract

“…PDS learning uses a sample average of the PDS value function to approximate given the experience tuple in each time slot: (9) where is a time-varying learning rate parameter.…”

Section: The Post-decision State Learning Algorithmmentioning

confidence: 99%

“…and ) in (8). Second, updating a single PDS using (9) provides information about the state-value function at many states. This is evident from the expected PDS value function on the right-hand-side of (8).…”

Section: The Post-decision State Learning Algorithmmentioning

confidence: 99%

“…Consequently, the statistical information obtained from the experience tuple at time can be extrapolated to other PDSs with different post-decision buffer states and different post-decision PM states. Due to space limitations, we refer the interested reader to our prior work [9], where we refer to this as virtual experience learning. Table 2 summarizes the parameters used in our MATLAB-based simulator.…”

Section: The Post-decision State Learning Algorithmmentioning

confidence: 99%

See 1 more Smart Citation

Reinforcement learning for energy-efficient wireless transmission

Mastronarde

Schaar

2011

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We consider the problem of energy-efficient point-to-point transmission of delay-sensitive data (e.g. multimedia data) over a fading channel. We propose a rigorous and unified framework for simultaneously utilizing both physical-layer centric and system-level techniques to minimize energy consumption, under delay constraints, in the presence of stochastic and unknown traffic and channel conditions. We formulate the problem as a Markov decision process and solve it online using reinforcement learning. The advantages of the proposed online method are that it exploits partial information about the system and it obviates the need for action exploration. Consequently, it significantly outperforms existing reinforcement learning solutions.

show abstract

“…On-line machine learning based power management has been recently practiced [26], [27], [28], [29], [30], [31]. For example, Q-learning [32], [33] can be utilized to find an optimal action-selection policy from set of states.…”

mentioning

confidence: 99%

A Q-Learning Based Self-Adaptive I/O Communication for 2.5D Integrated Many-Core Microprocessor and Memory

Manoj

Huang

et al. 2016

IEEE Trans. Comput.

View full text Add to dashboard Cite

A self-adaptive output-voltage swing adjustment is introduced in the design of energy-efficient I/O communication for 2.5D integrated many-core microprocessor and memory. Instead of transmitting signal with large voltage swing, a Q-learning based I/O management is deployed to adaptively adjust the I/O output-voltage swing under constraints of both communication power and bit error rate (BER). Simulation results show that the proposed adaptive 2.5D I/Os (in 65nm CMOS) can achieve an average of 12.5mW I/O power, 4GHz bandwidth and 3.125pJ/bit energy efficiency for one channel under 10 −6 BER. With the use of conventional Q-learning and further accelerated Q-learning, we can achieve 12.95% and 18.89% power reduction and 14% and 15.11% energy efficiency improvement when compared to the use of uniform output-voltage swing based I/O communication.

show abstract

Online Reinforcement Learning for Dynamic Multimedia Systems

Cited by 20 publications

References 28 publications

A Sarsa(λ)-Based Control Model for Real-Time Traffic Light Coordination

A Sarsa(λ)-Based Control Model for Real-Time Traffic Light Coordination

Reinforcement learning for energy-efficient wireless transmission

A Q-Learning Based Self-Adaptive I/O Communication for 2.5D Integrated Many-Core Microprocessor and Memory

Contact Info

Product

Resources

About