Multi-Task Deep Reinforcement Learning with PopArt

Hessel, Matteo; Soyer, Hubert; Espeholt, Lasse; Czarnecki, Wojciech Marian; Schmitt, Simon; Hasselt, Hado van

doi:10.1609/aaai.v33i01.33013796

Cited by 329 publications

(394 citation statements)

References 3 publications

Supporting

Mentioning

388

Contrasting

Unclassified

Order By: Relevance

“…We compared the following control policies: RGBD uses only the segmentation from

S_{θ} (x)

and thus no thermal measurements. It provides a loose lower bound on the performance since the additional thermal modality provides an important cue with respect to the segmentation task and improves the performance in general, no matter what views are selected. DQN provides reactive control similar to Mnih et al () with the double DQN extension from Hasselt, Guez, and Silver () and the prioritized experience replay from Schaul, Quan, Antonoglou, and Silver (). Greedy

D_{KL}

corresponds to the

Δ {scriptH}_{ω_{1}}

network predicting the gain obtained through self‐supervision. The predicted pixel‐wise gain is accumulated by viewpoint kernels and the maximum within the motion constraints is selected for the next action.

G Q_{0} D_{KL}

corresponds to the

Q_{ω}

network obtained from the self‐supervised policy initialization.

G Q_{1} Δ H

corresponds to the

Q_{ω}

network fine‐tuned on the guiding trajectories (

p = 1

) with

ω_{1}

previously trained to predict true gain

Δ H

. Optimal uses additional information of true

Δ H

to plan the optimal trajectory by solving instances of MILP.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Simultaneous exploration and segmentation for search and rescue

Petříček

Šalanský

Zimmermann

et al. 2018

Journal of Field Robotics

View full text Add to dashboard Cite

We consider the problem of active victim segmentation during a search‐and‐rescue (SAR) exploration mission. The robot is equipped with a multimodal sensor suite consisting of a camera, lidar, and pan‐tilt thermal sensor. The robot enters an unknown scene, builds a 3D model incrementally, and the proposed method simultaneously (a) segments the victims from incomplete multimodal measurements and (b) controls the motion of the thermal camera. Both of these tasks are difficult due to the lack of natural training data and the limited number of real‐world trials. In particular, we overcome the absence of training data for the segmentation task by employing a manually designed generative model, which provides a semisynthetic training data set. The limited number of real‐world trials is tackled by self‐supervised initialization and optimization‐based guiding of the motion control learning. In addition to that, we provide a quantitative evaluation of the proposed method on several real testing scenarios using the real SAR robot. Finally, we also provide a data set which will allow for further development of algorithms on the real data.

show abstract

“…We compared the following control policies: RGBD uses only the segmentation from

S_{θ} (x)

D_{KL}

corresponds to the

Δ {scriptH}_{ω_{1}}

G Q_{0} D_{KL}

corresponds to the

Q_{ω}

network obtained from the self‐supervised policy initialization.

G Q_{1} Δ H

corresponds to the

Q_{ω}

network fine‐tuned on the guiding trajectories (

p = 1

) with

ω_{1}

previously trained to predict true gain

Δ H

. Optimal uses additional information of true

Δ H

to plan the optimal trajectory by solving instances of MILP.…”

Section: Methodsmentioning

confidence: 99%

“…We compared the following control policies: • DQN provides reactive control similar to Mnih et al (2015) with the double DQN extension from Hasselt, Guez, and Silver (2016) and the prioritized experience replay from Schaul, Quan, Antonoglou, and Silver (2016).…”

Section: Experiments Using a Sar Platformmentioning

confidence: 99%

Simultaneous exploration and segmentation for search and rescue

Petříček

Šalanský

Zimmermann

et al. 2018

Journal of Field Robotics

View full text Add to dashboard Cite

show abstract

“…The concept of Q(s t , a t ) is to evaluate how good the action a t performed by the UAV in the state s t is. As illustrated in [14], DQN approximates the Q-value by using two deep neural networks (DNNs) with the same four fully connected layers but different parameters φ 1 and φ 2 . One is the predicted network, whose input is the current state-action pair (s t , a t ) and output is the predicted value, i.e., Q DQN predicted (s t , a t ; φ 1 ).…”

Section: A Deep Q-network (Dqn)mentioning

confidence: 99%

“…DQN structure chooses max a ′ Q(s t+1 , a ′ ; φ 2 ) directly in the target network, whose parameter is not updated timely and may lead to the overestimation of Q-value [14]. To address the overestimation problem, DDQN applies two independent estimators to approximate the Q-value.…”

Section: B Ddqn With Proposed Qos-based ǫ-Greedy Policymentioning

confidence: 99%

“…Moreover, [13] proposed the deterministic policy gradient algorithm to maximize the expected uplink sum rate in the UAV-aided cellular networks with mobile TUs. Among the value based DRL algorithms, [14] unveiled that double deep Q-network (DDQN) addresses the overestimation problem in deep Q-network (DQN) via decoupling target Q-value and predicted Q-value, and generates a more accurate state-action value function than DQN. It is known that the better state-action value function corresponds to the better policy.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Path Planning for UAV-Mounted Mobile Edge Computing With Deep Reinforcement Learning

Liu

Shi

Sun

et al. 2020

IEEE Trans. Veh. Technol.

177

View full text Add to dashboard Cite

In this letter, we study an unmanned aerial vehicle (UAV)-mounted mobile edge computing network, where the UAV executes computational tasks offloaded from mobile terminal users (TUs) and the motion of each TU follows a Gauss-Markov random model. To ensure the quality-of-service (QoS) of each TU, the UAV with limited energy dynamically plans its trajectory according to the locations of mobile TUs. Towards this end, we formulate the problem as a Markov decision process, wherein the UAV trajectory and UAV-TU association are modeled as the parameters to be optimized. To maximize the system reward and meet the QoS constraint, we develop a QoS-based action selection policy in the proposed algorithm based on double deep Q-network. Simulations show that the proposed algorithm converges more quickly and achieves a higher sum throughput than conventional algorithms.

show abstract

Measuring and characterizing generalization in deep reinforcement learning

et al. 2021

View full text Add to dashboard Cite

Deep reinforcement learning (RL) methods have achieved remarkable performance on challenging control tasks. Observations of the resulting behavior give the impression that the agent has constructed a generalized representation that supports insightful action decisions. We re‐examine what is meant by generalization in RL, and propose several definitions based on an agent's performance in on‐policy, off‐policy, and unreachable states. We propose a set of practical methods for evaluating agents with these definitions of generalization. We demonstrate these techniques on a common benchmark task for deep RL, and we show that the learned networks make poor decisions for states that differ only slightly from on‐policy states, even though those states are not selected adversarially. We focus our analyses on the deep Q‐networks (DQNs) that kicked off the modern era of deep RL. Taken together, these results call into question the extent to which DQNs learn generalized representations, and suggest that more experimentation and analysis is necessary before claims of representation learning can be supported.

show abstract

Multi-Task Deep Reinforcement Learning with PopArt

Cited by 329 publications

References 3 publications

Simultaneous exploration and segmentation for search and rescue

Simultaneous exploration and segmentation for search and rescue

Path Planning for UAV-Mounted Mobile Edge Computing With Deep Reinforcement Learning

Measuring and characterizing generalization in deep reinforcement learning

Contact Info

Product

Resources

About