2020
DOI: 10.1609/aaai.v34i04.5983
|View full text |Cite
|
Sign up to set email alerts
|

Scaling All-Goals Updates in Reinforcement Learning Using Convolutional Neural Networks

Abstract: Being able to reach any desired location in the environment can be a valuable asset for an agent. Learning a policy to navigate between all pairs of states individually is often not feasible. An all-goals updating algorithm uses each transition to learn Q-values towards all goals simultaneously and off-policy. However the expensive numerous updates in parallel limited the approach to small tabular cases so far. To tackle this problem we propose to use convolutional network architectures to generate Q-values an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 10 publications
(10 citation statements)
references
References 7 publications
0
10
0
Order By: Relevance
“…However, these algorithms require a complete evaluation before each update of policies, so they fail to achieve a high sampling efficiency. Cao et al [16] and Pardo et al [23] directly optimized the undiscounted return by augmenting the state space with the remaining time. Then, acyclic MDPs were constructed to eliminate transient traps.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…However, these algorithms require a complete evaluation before each update of policies, so they fail to achieve a high sampling efficiency. Cao et al [16] and Pardo et al [23] directly optimized the undiscounted return by augmenting the state space with the remaining time. Then, acyclic MDPs were constructed to eliminate transient traps.…”
Section: Related Workmentioning
confidence: 99%
“…For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ for preserving the optimal policy [13], [16], [17], [22]- [24]. On the other hand, similar to the cases with undiscounted returns [13], [16], [17], [23], [25], γ close to 1 (e.g., γ > 0.99) also causes the training instability problem. Thus, the analysis of the instability of undiscounted RL helps to alleviate the instability of optimizing a discounted return with a large γ .…”
mentioning
confidence: 99%
“…Lu et al [23] explored the use of causal models of the state dynamics for counterfactual data augmentation. Time limits in RL have been employed to manage task complexity and facilitate learning, with Pardo et al [28] analyzing their utility in diversifying training experience and boosting performance when combined with agent time-awareness.…”
Section: Rewardsmentioning
confidence: 99%
“…The terms targets and goals have been used in diverse manners in the existing literature. There are subgoal generation methods that generate intermediate goals such as imagined goal [31,14] or random goals [34] to help agents solve the tasks. Multi-goal RL [47,35,13,10] aims to deal with multiple tasks, learning to reach different goal states for each task.…”
Section: Related Workmentioning
confidence: 99%