Active Exploration and Parameterized Reinforcement Learning Applied to a Simulated Human-Robot Interaction Task

Khamassi, Mehdi; Velentzas, George; Tsitsimis, Theodore; Tzafestas, Costas S.

doi:10.1109/irc.2017.33

Cited by 25 publications

(18 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Masson et al [34] handled discrete action with Qlearning and policy search for continuous action. Similarly, Khamassi et al [35] use Q-learning and policy gradient to achieve the same results. Those methods assume on-policy and handle discrete and continuous actions separately.…”

Section: Related Workmentioning

confidence: 96%

D3PG: Dirichlet DDGP for Task Partitioning and Offloading with Constrained Hybrid Action Space in Mobile Edge Computing

Ale¹,

King²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

<div> Mobile Edge Computing (MEC) has been regarded as a promising paradigm to reduce service latency for data processing in Internet of Things, by provisioning computing resources at network edge. In this work, we jointly optimize the task partitioning and computational power allocation for computation offloading in a dynamic environment with multiple IoT devices and multiple edge servers. We formulate the problem as a Markov decision process with constrained hybrid action space, which cannot be well handled by existing deep reinforcement learning (DRL) algorithms. Therefore, we develop a novel Deep Reinforcement Learning called Dirichlet Deep Deterministic Policy Gradient (D3PG), which </div><div>is built on Deep Deterministic Policy Gradient (DDPG) to solve the problem. The developed model can learn to solve multi-objective optimization, including maximizing the number of tasks processed before expiration and minimizing the energy cost and service latency. More importantly, D3PG can effectively deal with constrained distribution-continuous hybrid action space, where the distribution variables are for the task partitioning and offloading, while the continuous variables are for computational frequency control. Moreover, the D3PG can address many similar issues in MEC and general reinforcement learning problems. Extensive simulation results show that the proposed D3PG outperforms the state-of-art methods.</div><div> Mobile Edge Computing (MEC) has been regarded as a promising paradigm to reduce service latency for data processing in Internet of Things, by provisioning computing resources at network edge. In this work, we jointly optimize the task partitioning and computational power allocation for computation offloading in a dynamic environment with multiple IoT devices and multiple edge servers. We formulate the problem as a Markov decision process with constrained hybrid action space, which cannot be well handled by existing deep reinforcement learning (DRL) algorithms. Therefore, we develop a novel Deep Reinforcement Learning called Dirichlet Deep Deterministic Policy Gradient (D3PG), which is built on Deep Deterministic Policy Gradient (DDPG) to solve the problem. The developed model can learn to solve multi-objective optimization, including maximizing the number of tasks processed before expiration and minimizing the energy cost and service latency. More importantly, D3PG can effectively deal with constrained distribution-continuous hybrid action space, where the distribution variables are for the task partitioning and offloading, while the continuous variables are for computational frequency control. Moreover, the D3PG can address many similar issues in MEC and general reinforcement learning problems. Extensive simulation results show that the proposed D3PG outperforms the state-of-art methods.</div>

show abstract

Section: Related Workmentioning

confidence: 96%

D3PG: Dirichlet DDGP for Task Partitioning and Offloading with Constrained Hybrid Action Space in Mobile Edge Computing

Ale¹,

King²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, the training time for the epsilon-greedy strategy is proportional to the scale of state space and action space [12] [13]. Another common method for explorationexploitation is the Boltzmann exploration strategy [14] [15] [16]. The Boltzmann exploration strategy guides a robot to select an action with a probability depending on the value function and a temperature function restrains the confusion of action selection.…”

Section: The Exploration-exploitation Dilemma In Obstacle Avoidancmentioning

confidence: 99%

An Experience Aggregative Reinforcement Learning With Multi-Attribute Decision-Making for Obstacle Avoidance of Wheeled Mobile Robot

Ning

Meng

et al. 2020

IEEE Access

View full text Add to dashboard Cite

A variety of reinforcement learning (RL) methods are developed to achieve the motion control for the robotic systems, which has been a hot issue. However, the performance of the conventional RL methods often encounters a bottleneck, because the robots have difficulty in choosing an appropriate action in the control task due to the exploration-exploitation dilemma. To address this problem and improve the learning performance, this work introduces an experience aggregative reinforcement learning method with a Multi-Attribute Decision-Making (MADM) to achieve the real-time obstacle avoidance of wheeled mobile robot (WMR). The proposed method employs an experience aggregation method to cluster experiential samples and it can achieve more effective experience storage. Moreover, to achieve the effective action selection using the prior experience, an action selection policy based on a Multi-Attribute Decision-Making is proposed. Inspired by the hierarchical decision-making, this work decomposes the original obstacle avoidance task into two sub-tasks using a divide-and-conquer approach. Each sub-task is trained individually by a double Q-learning using a simple reward function. Each sub-task learns an action policy, which enables the sub-task to selects an appropriate action to achieve a single goal. The standardized rewards of sub-tasks are calculated when fusing these sub-tasks to eliminate differences in rewards for sub-tasks. Then, the proposed method integrates the prior experience of three trained sub-tasks via an action policy based on a MADM to complete the source task. Simulation results show that the proposed method outperforms competitors.

show abstract

“…In previous work, we have applied this meta-learning principle in an algorithm here referred to as MLB to dynamically tune β t in a simple multi-armed bandit scenario involving the interaction between a simulated human and a robot [10]. Formally, function F is a Boltzmann softmax with parameter φ = 0 (i.e.…”

Section: Problem Formulation and Algorithmsmentioning

confidence: 99%

“…The mid-and long-term rewards are also calculated like in [8] and used to update the inverse temperature parameter β t according to the update rule of MLB [10]. When the uncertainty of an arm's action value increases, the respective arm should be explored more.…”

Section: Hybrid Meta-learning With Kalman Filters -Mlb-kfmentioning

confidence: 99%

Bridging Computational Neuroscience and Machine Learning on Non-Stationary Multi-Armed Bandits

Velentzas

Tzafestas

Khamassi

2017

Preprint

Self Cite

View full text Add to dashboard Cite

Fast adaptation to changes in the environment requires both natural and artificial agents to be able to dynamically tune an exploration-exploitation trade-off during learning. This trade-off usually determines a fixed proportion of exploitative choices (i.e. choice of the action that subjectively appears as best at a given moment) relative to exploratory choices (i.e. testing other actions that now appear worst but may turn out promising later). The problem of finding an efficient exploration-exploitation trade-off has been well studied both in the Machine Learning and Computational Neuroscience fields. Rather than using a fixed proportion, non-stationary multi-armed bandit methods in the former have proven that principles such as exploring actions that have not been tested for a long time can lead to performance closer to optimalbounded regret. In parallel, researches in the latter have investigated solutions such as progressively increasing exploitation in response to improvements of performance, transiently increasing exploration in response to drops in average performance, or attributing exploration bonuses specifically to actions associated with high uncertainty in order to gain information when performing these actions. In this work, we first try to bridge some of these different methods from the two research fields by rewriting their decision process with a common formalism. We then show numerical simulations of a hybrid algorithm combining bio-inspired meta-learning, kalman filter and exploration bonuses compared to several state-of-the-art alternatives on a set of non-stationary stochastic multi-armed bandit tasks. While we find that different methods are appropriate in different scenarios, the hybrid algorithm displays a good combination of advantages from different methods and outperforms these methods in the studied scenarios.

show abstract

Active Exploration and Parameterized Reinforcement Learning Applied to a Simulated Human-Robot Interaction Task

Cited by 25 publications

References 20 publications

D3PG: Dirichlet DDGP for Task Partitioning and Offloading with Constrained Hybrid Action Space in Mobile Edge Computing

D3PG: Dirichlet DDGP for Task Partitioning and Offloading with Constrained Hybrid Action Space in Mobile Edge Computing

An Experience Aggregative Reinforcement Learning With Multi-Attribute Decision-Making for Obstacle Avoidance of Wheeled Mobile Robot

Bridging Computational Neuroscience and Machine Learning on Non-Stationary Multi-Armed Bandits

Contact Info

Product

Resources

About