2019
DOI: 10.1609/aaai.v33i01.33013796
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Task Deep Reinforcement Learning with PopArt

Abstract: The reinforcement learning community has made great strides in designing algorithms capable of exceeding human performance on specific tasks. These algorithms are mostly trained one task at the time, each new task requiring to train a brand new agent instance. This means the learning algorithm is general, but each solution is not; each agent can only solve the one task it was trained on. In this work, we study the problem of learning to master not one but multiple sequentialdecision tasks at once. A general is… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
388
0
3

Year Published

2020
2020
2021
2021

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 329 publications
(394 citation statements)
references
References 3 publications
2
388
0
3
Order By: Relevance
“…We compared the following control policies: RGBD uses only the segmentation from Sθ(x) and thus no thermal measurements. It provides a loose lower bound on the performance since the additional thermal modality provides an important cue with respect to the segmentation task and improves the performance in general, no matter what views are selected. DQN provides reactive control similar to Mnih et al () with the double DQN extension from Hasselt, Guez, and Silver () and the prioritized experience replay from Schaul, Quan, Antonoglou, and Silver (). Greedy DKL corresponds to the ΔscriptHω1 network predicting the gain obtained through self‐supervision. The predicted pixel‐wise gain is accumulated by viewpoint kernels and the maximum within the motion constraints is selected for the next action. GQ0DKL corresponds to the Qω network obtained from the self‐supervised policy initialization. GQ1ΔH corresponds to the Qω network fine‐tuned on the guiding trajectories (p=1) with ω1 previously trained to predict true gain ΔH. Optimal uses additional information of true ΔH to plan the optimal trajectory by solving instances of MILP.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We compared the following control policies: RGBD uses only the segmentation from Sθ(x) and thus no thermal measurements. It provides a loose lower bound on the performance since the additional thermal modality provides an important cue with respect to the segmentation task and improves the performance in general, no matter what views are selected. DQN provides reactive control similar to Mnih et al () with the double DQN extension from Hasselt, Guez, and Silver () and the prioritized experience replay from Schaul, Quan, Antonoglou, and Silver (). Greedy DKL corresponds to the ΔscriptHω1 network predicting the gain obtained through self‐supervision. The predicted pixel‐wise gain is accumulated by viewpoint kernels and the maximum within the motion constraints is selected for the next action. GQ0DKL corresponds to the Qω network obtained from the self‐supervised policy initialization. GQ1ΔH corresponds to the Qω network fine‐tuned on the guiding trajectories (p=1) with ω1 previously trained to predict true gain ΔH. Optimal uses additional information of true ΔH to plan the optimal trajectory by solving instances of MILP.…”
Section: Methodsmentioning
confidence: 99%
“…We compared the following control policies: • DQN provides reactive control similar to Mnih et al (2015) with the double DQN extension from Hasselt, Guez, and Silver (2016) and the prioritized experience replay from Schaul, Quan, Antonoglou, and Silver (2016).…”
Section: Experiments Using a Sar Platformmentioning
confidence: 99%
“…The concept of Q(s t , a t ) is to evaluate how good the action a t performed by the UAV in the state s t is. As illustrated in [14], DQN approximates the Q-value by using two deep neural networks (DNNs) with the same four fully connected layers but different parameters φ 1 and φ 2 . One is the predicted network, whose input is the current state-action pair (s t , a t ) and output is the predicted value, i.e., Q DQN predicted (s t , a t ; φ 1 ).…”
Section: A Deep Q-network (Dqn)mentioning
confidence: 99%
“…DQN structure chooses max a ′ Q(s t+1 , a ′ ; φ 2 ) directly in the target network, whose parameter is not updated timely and may lead to the overestimation of Q-value [14]. To address the overestimation problem, DDQN applies two independent estimators to approximate the Q-value.…”
Section: B Ddqn With Proposed Qos-based ǫ-Greedy Policymentioning
confidence: 99%
See 1 more Smart Citation