2022

DOI: 10.48550/arxiv.2202.12504

|View full text |Cite

Preprint

|

Sign up to set email alerts

|

Consolidated Adaptive T-soft Update for Deep Reinforcement Learning

Taisuke Kobayashi¹

Abstract: Demand for deep reinforcement learning (DRL) is gradually increased to enable robots to perform complex tasks, while DRL is known to be unstable. As a technique to stabilize its learning, a target network that slowly and asymptotically matches a main network is widely employed to generate stable pseudo-supervised signals. Recently, T-soft update has been proposed as a noise-robust update rule for the target network and has contributed to improving the DRL performance. However, the noise robustness of T-soft up… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Introduction5

Citation Types

Supporting

0

Mentioning

0

Contrasting

0

Year Published

2023

2023

2023

2023

Publication Types

Select...

Article1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

(12 citation statements)

References 17 publications

(30 reference statements)

Supporting

0

Mentioning

0

Contrasting

0

Order By: Relevance

“…Initially, the "hard" update strategy of copying the main Q-network to the targetnetwork after a certain period of time was used [5]. Since then, the "soft" update strategy has been used by interpolating the parameters of the target-network using a fixed ratio between the current parameters of the target-network and the parameters of the main Q-network [10][11][12][13][14].…”

Section: Introductionmentioning

confidence: 99%

“…However, these methods exhibit several limitations. In addition to their slow learning speed, they are sensitive to noise and outliers in parameter updates for the main Q-network [11][12][13]. A simple solution to this problem is to reduce the copy size; however, the learning speed is considerably slower.…”

Section: Introductionmentioning

confidence: 99%

“…A simple solution to this problem is to reduce the copy size; however, the learning speed is considerably slower. As a result, a t-soft update based on the Student's t-distribution, a well-known distribution that is robust to outliers [12,13], and Mellowmax [14], a softmax operator that does not use target-network updates, were proposed. However, these studies might force us to use more hyperparameters in addition to hyper-parameters, such as the learning rate (LR) used in existing DQN, to use these methods [12][13][14].…”

Section: Introductionmentioning

confidence: 99%

“…As a result, a t-soft update based on the Student's t-distribution, a well-known distribution that is robust to outliers [12,13], and Mellowmax [14], a softmax operator that does not use target-network updates, were proposed. However, these studies might force us to use more hyperparameters in addition to hyper-parameters, such as the learning rate (LR) used in existing DQN, to use these methods [12][13][14]. Recent research shows how well-known algorithms are highly sensitive to the setting of hyperparameters and the details of implementation [15].…”

Section: Introductionmentioning

confidence: 99%

“…Although Mellowmax uses more hyperparameters provided by a grid-search method or adaptive methods for applying hyperparameters to various domains, this is a topic for future research [14]. According to t-soft updates [12,13], this could deliberately affect the training process depending on the design of the reinforcement learning reward; therefore, it would be forced to be a very large reward. Such criteria may make it difficult to demonstrate Student's t-distribution performance, and a more sophisticated method should be used because it cannot easily reach the global optimum [11].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Target-Network Update Linked with Learning Rate Decay Based on Mutual Information and Reward in Deep Reinforcement Learning

Kim

2023

View full text Add to dashboard Cite

In this study, a target-network update of deep reinforcement learning (DRL) based on mutual information (MI) and rewards is proposed. In DRL, updating the target network from the Q network was used to reduce training diversity and contribute to the stability of learning. If it is not properly updated, the overall update rate is reduced to mitigate this problem. Simply slowing down is not recommended because it reduces the speed of the decaying learning rate. Some studies have been conducted to improve the issues with the t-soft update based on the Student’s-t distribution or a method that does not use the target-network. However, there are certain situations in which using the Student’s-t distribution might fail or force it to use more hyperparameters. A few studies have used MI in deep neural networks to improve the decaying learning rate and directly update the target-network by replaying experiences. Therefore, in this study, the MI and reward provided in the experience replay of DRL are combined to improve both the decaying learning rate and the target-network updating. Utilizing rewards is appropriate for use in environments with intrinsic symmetry. It has been confirmed in various OpenAI gymnasiums that stable learning is possible while maintaining an improvement in the decaying learning rate.

“…Initially, the "hard" update strategy of copying the main Q-network to the targetnetwork after a certain period of time was used [5]. Since then, the "soft" update strategy has been used by interpolating the parameters of the target-network using a fixed ratio between the current parameters of the target-network and the parameters of the main Q-network [10][11][12][13][14].…”

Section: Introductionmentioning

confidence: 99%

“…However, these methods exhibit several limitations. In addition to their slow learning speed, they are sensitive to noise and outliers in parameter updates for the main Q-network [11][12][13]. A simple solution to this problem is to reduce the copy size; however, the learning speed is considerably slower.…”

Section: Introductionmentioning

confidence: 99%

“…A simple solution to this problem is to reduce the copy size; however, the learning speed is considerably slower. As a result, a t-soft update based on the Student's t-distribution, a well-known distribution that is robust to outliers [12,13], and Mellowmax [14], a softmax operator that does not use target-network updates, were proposed. However, these studies might force us to use more hyperparameters in addition to hyper-parameters, such as the learning rate (LR) used in existing DQN, to use these methods [12][13][14].…”

Section: Introductionmentioning

confidence: 99%

“…As a result, a t-soft update based on the Student's t-distribution, a well-known distribution that is robust to outliers [12,13], and Mellowmax [14], a softmax operator that does not use target-network updates, were proposed. However, these studies might force us to use more hyperparameters in addition to hyper-parameters, such as the learning rate (LR) used in existing DQN, to use these methods [12][13][14]. Recent research shows how well-known algorithms are highly sensitive to the setting of hyperparameters and the details of implementation [15].…”

Section: Introductionmentioning

confidence: 99%

“…Although Mellowmax uses more hyperparameters provided by a grid-search method or adaptive methods for applying hyperparameters to various domains, this is a topic for future research [14]. According to t-soft updates [12,13], this could deliberately affect the training process depending on the design of the reinforcement learning reward; therefore, it would be forced to be a very large reward. Such criteria may make it difficult to demonstrate Student's t-distribution performance, and a more sophisticated method should be used because it cannot easily reach the global optimum [11].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Target-Network Update Linked with Learning Rate Decay Based on Mutual Information and Reward in Deep Reinforcement Learning

Kim

2023

View full text Add to dashboard Cite

In this study, a target-network update of deep reinforcement learning (DRL) based on mutual information (MI) and rewards is proposed. In DRL, updating the target network from the Q network was used to reduce training diversity and contribute to the stability of learning. If it is not properly updated, the overall update rate is reduced to mitigate this problem. Simply slowing down is not recommended because it reduces the speed of the decaying learning rate. Some studies have been conducted to improve the issues with the t-soft update based on the Student’s-t distribution or a method that does not use the target-network. However, there are certain situations in which using the Student’s-t distribution might fail or force it to use more hyperparameters. A few studies have used MI in deep neural networks to improve the decaying learning rate and directly update the target-network by replaying experiences. Therefore, in this study, the MI and reward provided in the experience replay of DRL are combined to improve both the decaying learning rate and the target-network updating. Utilizing rewards is appropriate for use in environments with intrinsic symmetry. It has been confirmed in various OpenAI gymnasiums that stable learning is possible while maintaining an improvement in the decaying learning rate.

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Product

Browser Extension Assistant by scite Citation Statement Search Reference Check Visualizations Dashboards Explore Journals Explore Organizations Explore Funders Embedding Badge Embedding Citation Search Pricing

Resources

Blog Help & FAQ Accessibility Statement API Terms For Universities & Governments For Researchers For Publishers For Corporate, Pharma & Enterprise Author Marketing Become an Affiliate Get an organization trial or quote scite Data & Services

About

News & Press Careers Read our Paper Coverage

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Copyright © 2024 scite LLC. All rights reserved.

Made with 💙 for researchers

Part of the Research Solutions Family.