Challenges of Reinforcement Learning

Ding, Zihan; Dong, Hao

doi:10.1007/978-981-15-4095-0_7

Cited by 65 publications

(76 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…RL methods suffer from scalability issues because of a large number of states [29]. We evaluate the scalability of our algorithms depending on the different number of layers in the network topology shown in Figure 3.…”

Section: G Scalabilitymentioning

confidence: 99%

Multi-Agent Reinforcement Learning-Based Resource Management for End-to-End Network Slicing

Kim

Lim

2021

IEEE Access

View full text Add to dashboard Cite

To meet the explosive growth of mobile traffic, the 5G network is designed to be flexible and support multi-access edge computing (MEC), thereby improving the end-to-end quality of service (QoS). In particular, 5G network slicing, which allows a physical infrastructure to split into multiple logical networks, keeps the balance of network resource allocation among different service types with on-demand resource requests. However, achieving effective resource allocation across the end-to-end network is difficult due to the dynamic characteristics of slicing requests such as uncertain real-time resource demand and heterogeneous requirements. In this paper, we develop a reinforcement learning (RL)-based dynamic resource allocation framework for end-to-end network slicing with heterogeneous requirements in multi-layer MEC environments. We first design a hierarchical MEC architecture and formulate a resource allocation problem for the end-to-end network slicing as an optimization problem using the Markov decision process (MDP). Using proximal policy optimization (PPO), we develop independently-collaborative and jointlycollaborative dynamic resource allocation algorithms to maximize resource efficiency while satisfying the QoS of slices. Experimental results show that the proposed algorithms can recognize the characteristics of slice requests and coming resource demands and efficiently allocate resources with a high QoS satisfaction rate.INDEX TERMS 5G, network slicing, multi-access edge computing, network resource management, multiagent reinforcement learning.

show abstract

Section: G Scalabilitymentioning

confidence: 99%

Multi-Agent Reinforcement Learning-Based Resource Management for End-to-End Network Slicing

Kim

Lim

2021

IEEE Access

View full text Add to dashboard Cite

show abstract

“…In recent years, various off-policy RL algorithms have been successfully applied and have shown significant performance improvements in the challenging tasks from classic Atari games and Go, [1]- [5] to robotics control environments such as MuJoCo [6]- [13] and realworld implementation of robotic control [9]. However, there still exist two conventional challenges in the off-policy RL; exploration of large state space and efficient utilization of the stored experiences [14], [15]. The exploration focuses on how to make an RL agent encounter new and diverse experiences, and the experience utilization addresses on how to make an RL agent obtain the full knowledge by learning from the stored experiences.…”

Section: Introductionmentioning

confidence: 99%

“…This sequential process is called experience replay (ER) [16], which has been widely used for the off-policy RL algorithms with buffer size large enough to store various experience samples over a wide enough interval. However, the application of ER for off-policy RL still suffers from sampling inefficiency [14], [17]; while an agent needs to sample useful experience tuples in the RB at all times for optimal policy development, the sampling could be inefficient when the agent uses the conventional sampling techniques, for example, uniform sampling, especially in the early stage of learning when RB is being filled. In fact, applying uniform sampling results in a relatively high sampling frequency of experience tuples stored early in the RB, because the earlier experience tuples will be included in the sampling window a lot more times than the later experience tuples [18], [19].…”

Section: Introductionmentioning

confidence: 99%

Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay

2021

View full text Add to dashboard Cite

Utilizing the collected experience tuples in the replay buffer (RB) is the primary way of exploiting the experiences in the off-policy reinforcement learning (RL) algorithms, and, therefore, the sampling scheme for the experience tuples in the RB can be critical for experience utilization. In this paper, it is found that a widely used sampling scheme in the off-policy RL suffers from inefficiency due to the inadequate uneven sampling of experience tuples from the RB. In fact, the conventional uniform sampling of the experience tuples in the RB causes a severely unbalanced experience utilization, since experiences stored earlier in the RB is sampled with much higher frequency especially in the early stage of learning. We mitigate this fundamental problem by employing a half-normal sampling probability window that allocates a higher sampling probability to newer experiences in the RB. In addition, we propose general and local size adjustment schemes that determine the standard deviation of the half-normal sampling window to enhance the learning speed and performance and to mitigate the temporary performance degradation during training, respectively. For performance demonstration, we apply the proposed sampling technique to the state-of-the-art off-policy RL algorithms and test for various RL benchmark tasks such as MuJoCo gym and CARLA simulator. As a result, the proposed technique shows considerable learning speed and final performance improvement, especially on the tasks with large state and action space. Furthermore, the proposed sampling technique increases the stability of the considered RL algorithms, verified with less variance of the performance results across different random seeds of network initialization.

show abstract

“…Hence, multiple policies which are correlated to each task in the personfollowing robots development are required to be obtained during the training process. Along with the potentials in training and acquiring optimal policies for robots using deep reinforcement learning (DRL) [7], [8], which is the combination of deep learning (DL) and reinforcement learning (RL), applying the approach to develop a person-following robot can be one of great options to be considered. By using DRL, the training process for a specific policy which is represented as a DL model can be done directly without having to previously collect enormous labeled datasets.…”

Section: Introductionmentioning

confidence: 99%

Integrating Multiple Policies for Person-Following Robot Training Using Deep Reinforcement Learning

Dewa

Miura

2021

IEEE Access

View full text Add to dashboard Cite

Given a training environment which follows Markov decision process for a specific task, a deep reinforcement learning (DRL) agent is able to find possible optimal policies which map states of the environment to appropriate actions by repeatedly trying various actions to maximize training rewards. However, the learned policies cannot be reused directly in the training process for other new tasks resulting wasted precious time and resources. To solve this problem, we propose a DRL-based method for training an agent capable of selecting the appropriate policy for current state of the environment from a set of previously trained optimal policies for a given task which can be decomposed into other sub tasks. We implement our proposed method to a person-following robot task training that can be broken down into three sub tasks, namely: navigation, left attending, and right attending. Using the proposed method, the previously learned optimal navigation policy obtained from our previous work is integrated to attending policies which are trained in this study. We also introduce the use of weight-scheduled action smoothing which is able to stabilize actions generated by the agent in the attending task training. Our experiment results show that the proposed method is able to integrate all sub policies using the action smoothing method even though the navigation and the attending policies have dissimilar input structures, unalike output ranges, and are trained in different ways. Moreover, our proposed method shows better results compared to training from scratch and training using transfer learning strategy. INDEX TERMSDeep reinforcement learning, policies integration, person-following robot, robot training

show abstract

Challenges of Reinforcement Learning

Cited by 65 publications

References 14 publications

Multi-Agent Reinforcement Learning-Based Resource Management for End-to-End Network Slicing

Multi-Agent Reinforcement Learning-Based Resource Management for End-to-End Network Slicing

Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay

Integrating Multiple Policies for Person-Following Robot Training Using Deep Reinforcement Learning

Contact Info

Product

Resources

About