Scaling All-Goals Updates in Reinforcement Learning Using Convolutional Neural Networks

Pardo, Fabio; Levdik, Vitaly; Kormushev, Petar

doi:10.1609/aaai.v34i04.5983

Cited by 10 publications

(10 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, these algorithms require a complete evaluation before each update of policies, so they fail to achieve a high sampling efficiency. Cao et al [16] and Pardo et al [23] directly optimized the undiscounted return by augmenting the state space with the remaining time. Then, acyclic MDPs were constructed to eliminate transient traps.…”

Section: Related Workmentioning

confidence: 99%

“…For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ for preserving the optimal policy [13], [16], [17], [22]- [24]. On the other hand, similar to the cases with undiscounted returns [13], [16], [17], [23], [25], γ close to 1 (e.g., γ > 0.99) also causes the training instability problem. Thus, the analysis of the instability of undiscounted RL helps to alleviate the instability of optimizing a discounted return with a large γ .…”

mentioning

confidence: 99%

See 1 more Smart Citation

Partial Consistency for Stabilizing Undiscounted Reinforcement Learning

Gao

Yang

Tan

et al. 2023

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

Undiscounted return is an important setup in reinforcement learning (RL) and characterizes many real-world problems. However, optimizing an undiscounted return often causes training instability. The causes of this instability problem have not been analyzed in-depth by existing studies. In this article, this problem is analyzed from the perspective of value estimation. The analysis result indicates that the instability originates from transient traps that are caused by inconsistently selected actions. However, selecting one consistent action in the same state limits exploration. For balancing exploration effectiveness and training stability, a novel sampling method called last-visit sampling (LVS) is proposed to ensure that a part of actions is selected consistently in the same state. The LVS method decomposes the state-action value into two parts, i.e., the last-visit (LV) value and the revisit value. The decomposition ensures that the LV value is determined by consistently selected actions. We prove that the LVS method can eliminate transient traps while preserving optimality. Also, we empirically show that the method can stabilize the training processes of five typical tasks, including vision-based navigation and manipulation tasks.

show abstract

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

Partial Consistency for Stabilizing Undiscounted Reinforcement Learning

Gao

Yang

Tan

et al. 2023

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

show abstract

“…Lu et al [23] explored the use of causal models of the state dynamics for counterfactual data augmentation. Time limits in RL have been employed to manage task complexity and facilitate learning, with Pardo et al [28] analyzing their utility in diversifying training experience and boosting performance when combined with agent time-awareness.…”

Section: Rewardsmentioning

confidence: 99%

Offline Deep Reinforcement Learning for Dynamic Pricing of Consumer Credit

Khraishi¹,

Okhrati²

2022

Proceedings of the Third ACM International Conference on AI in Finance

View full text Add to dashboard Cite

Data augmentation is a widely used technique for improving model performance in machine learning, particularly in computer vision and natural language processing. Recently, there has been increasing interest in applying augmentation techniques to reinforcement learning (RL) problems, with a focus on image-based augmentation. In this paper, we explore a set of generic wrappers designed to augment RL environments with noise and encourage agent exploration and improve training data diversity which are applicable to a broad spectrum of RL algorithms and environments. Specifically, we concentrate on augmentations concerning states, rewards, and transition dynamics and introduce two novel augmentation techniques. In addition, we introduce a noise rate hyperparameter for control over the frequency of noise injection. We present experimental results on the impact of these wrappers on return using three popular RL algorithms, Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), and Proximal Policy Optimization (PPO), across five MuJoCo environments. To support the choice of augmentation technique in practice, we also present analysis that explores the performance these techniques across environments. Lastly, we publish the wrappers in our noisyenv repository for use with gym environments. CCS CONCEPTS• Theory of computation → Reinforcement learning.

show abstract

“…The terms targets and goals have been used in diverse manners in the existing literature. There are subgoal generation methods that generate intermediate goals such as imagined goal [31,14] or random goals [34] to help agents solve the tasks. Multi-goal RL [47,35,13,10] aims to deal with multiple tasks, learning to reach different goal states for each task.…”

Section: Related Workmentioning

confidence: 99%

Goal-Aware Cross-Entropy for Multi-Target Reinforcement Learning

Kim¹,

Lee²,

Kim³

et al. 2021

Preprint

View full text Add to dashboard Cite

Learning in a multi-target environment without prior knowledge about the targets requires a large amount of samples and makes generalization difficult. To solve this problem, it is important to be able to discriminate targets through semantic understanding. In this paper, we propose goal-aware cross-entropy (GACE) loss, that can be utilized in a self-supervised way using auto-labeled goal states alongside reinforcement learning. Based on the loss, we then devise goal-discriminative attention networks (GDAN) which utilize the goal-relevant information to focus on the given instruction. We evaluate the proposed methods on visual navigation and robot arm manipulation tasks with multi-target environments and show that GDAN outperforms the state-of-the-art methods in terms of task success ratio, sample efficiency, and generalization. Additionally, qualitative analyses demonstrate that our proposed method can help the agent become aware of and focus on the given instruction clearly, promoting goal-directed behavior.

show abstract

Scaling All-Goals Updates in Reinforcement Learning Using Convolutional Neural Networks

Cited by 10 publications

References 7 publications

Partial Consistency for Stabilizing Undiscounted Reinforcement Learning

Partial Consistency for Stabilizing Undiscounted Reinforcement Learning

Offline Deep Reinforcement Learning for Dynamic Pricing of Consumer Credit

Goal-Aware Cross-Entropy for Multi-Target Reinforcement Learning

Contact Info

Product

Resources

About