2023
DOI: 10.1109/lra.2022.3229236
|View full text |Cite
|
Sign up to set email alerts
|

Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning

Abstract: Early stopping based on the validation set performance is a popular approach to find the right balance between under-and overfitting in the context of supervised learning. However, in reinforcement learning, even for supervised sub-problems such as world model learning, early stopping is not applicable as the dataset is continually evolving. As a solution, we propose a new general method that dynamically adjusts the update to data (UTD) ratio during training based on underand overfitting detection on a small s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 24 publications
0
5
0
Order By: Relevance
“…This approach proves advantageous in handling large-scale problems where explicit enumeration becomes computationally infeasible. This enhances the suitability of DRL methods for real-world applications [28].…”
Section: A Markov Decision Process Formulationmentioning
confidence: 85%
“…This approach proves advantageous in handling large-scale problems where explicit enumeration becomes computationally infeasible. This enhances the suitability of DRL methods for real-world applications [28].…”
Section: A Markov Decision Process Formulationmentioning
confidence: 85%
“…REDQ (Chen et al 2020) and its variant (Wu et al 2022b) employ ensembled value networks to mitigate the value bias caused by increased UTD. (Dorka, Welschehold, and Burgard 2023) addresses model overfitting by employing dynamic UTD. (Li et al 2022) investigates the factors contributing to inferior performance in high UTD learning.…”
Section: Utd In Rlmentioning
confidence: 99%
“…Adaptively Calibrated Critic (ACC) [29] is a concurrent method presenting essentially the same idea as this paper while using slightly different optimization procedure. The reason for a slight improvement of ACC over AdaTQC on Hopper is likely the limited range of possible values for η ∈ [0, 5].…”
Section: Process Of η Adaptationmentioning
confidence: 99%
“…Our paper organically continues this line of research on overestimation bias. One other work that is closely related to our paper is Adaptively Calibrated Critic (ACC) [29]. It is a concurrent workshop paper, that investigates basically the same idea, but only applied to TQC.…”
mentioning
confidence: 99%