Deep Reinforcement Learning with Double Q-Learning

Hasselt, Hado van; Guez, Arthur; Silver, David

doi:10.1609/aaai.v30i1.10295

Cited by 2,865 publications

(924 citation statements)

References 8 publications

Supporting

Mentioning

878

Contrasting

Unclassified

Order By: Relevance

“…Moreover, |Ac| is the size of the action space and

. To make our RL agent more robust for more stable learning and to handle the problem of the overestimation of Q-values, the double Q-network [ 46 ] and fixed Q-targets [ 47 ] were also incorporated:

where TD is the temporal difference; and

is another dueling DQN network, as the target network and its parameters (

) were fixed and copied from the dueling DQN

every m step ( m = 20). To update the parameters (

) from the dueling DQN as shown in Figure 7 , we trained our RL agent by minimizing the loss function:

where E is the expectation.…”

Section: Methodsmentioning

confidence: 99%

“…To combine FBDD with our RL framework, we first collected and built a SARS-CoV-2 3CL pro inhibitor dataset containing 284 reported molecules. We adopted the improved BRICS algorithm [ 46 ] to split these molecules to obtain the fragment library target on SARS-CoV-2 3CL pro , as demonstrated in the flowchart in Figure 1 (yellow box). An elaborate filtering cascade is accompanied by manual inspection, and the rules can be changed based on the needs of different studies.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

AI-Aided Design of Novel Targeted Covalent Inhibitors against SARS-CoV-2

Tang

Liu

et al. 2022

Biomolecules

View full text Add to dashboard Cite

The drug repurposing of known approved drugs (e.g., lopinavir/ritonavir) has failed to treat SARS-CoV-2-infected patients. Therefore, it is important to generate new chemical entities against this virus. As a critical enzyme in the lifecycle of the coronavirus, the 3C-like main protease (3CLpro or Mpro) is the most attractive target for antiviral drug design. Based on a recently solved structure (PDB ID: 6LU7), we developed a novel advanced deep Q-learning network with a fragment-based drug design (ADQN–FBDD) for generating potential lead compounds targeting SARS-CoV-2 3CLpro. We obtained a series of derivatives from the lead compounds based on our structure-based optimization policy (SBOP). All of the 47 lead compounds obtained directly with our AI model and related derivatives based on the SBOP are accessible in our molecular library. These compounds can be used as potential candidates by researchers to develop drugs against SARS-CoV-2.

show abstract

“…Moreover, |Ac| is the size of the action space and

. To make our RL agent more robust for more stable learning and to handle the problem of the overestimation of Q-values, the double Q-network [ 46 ] and fixed Q-targets [ 47 ] were also incorporated:

where TD is the temporal difference; and

is another dueling DQN network, as the target network and its parameters (

) were fixed and copied from the dueling DQN

every m step ( m = 20). To update the parameters (

) from the dueling DQN as shown in Figure 7 , we trained our RL agent by minimizing the loss function:

where E is the expectation.…”

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

AI-Aided Design of Novel Targeted Covalent Inhibitors against SARS-CoV-2

Tang

Liu

et al. 2022

Biomolecules

View full text Add to dashboard Cite

show abstract

“…A deep Q network approximates Q -function with a neural network. In this article, we particularly used double DQN ( Van Hasselt et al, 2016 ), where a target network is used to find the loss between the current and desired prediction of the Q values. This loss is then used to update the weights of the neural network representing the agent.…”

Section: Knowledge-guided Reinforcement Learningmentioning

confidence: 99%

Koopman Operator–Based Knowledge-Guided Reinforcement Learning for Safe Human–Robot Interaction

Sinha

Wang²

2022

Front. Robot. AI

View full text Add to dashboard Cite

We developed a novel framework for deep reinforcement learning (DRL) algorithms in task constrained path generation problems of robotic manipulators leveraging human demonstrated trajectories. The main contribution of this article is to design a reward function that can be used with generic reinforcement learning algorithms by utilizing the Koopman operator theory to build a human intent model from the human demonstrated trajectories. In order to ensure that the developed reward function produces the correct reward, the demonstrated trajectories are further used to create a trust domain within which the Koopman operator–based human intent prediction is considered. Otherwise, the proposed algorithm asks for human feedback to receive rewards. The designed reward function is incorporated inside the deep Q-learning (DQN) framework, which results in a modified DQN algorithm. The effectiveness of the proposed learning algorithm is demonstrated using a simulated robotic arm to learn the paths for constrained end-effector motion and considering the safety of the human in the surroundings of the robot.

show abstract

“…RL usually consists of value-based methods and policy-based methods. The valued-based methods approximate value functions with tabular charts and neural networks, typically like DQN [ 17 ], Dueling-DQN [ 18 ], and Double DQN [ 19 ]. The value-based Q-learning will explode when the space dimension increases.…”

Section: Introductionmentioning

confidence: 99%

Optimal Policy of Multiplayer Poker via Actor-Critic Reinforcement Learning

Shi

Guo

Liu

et al. 2022

Entropy

View full text Add to dashboard Cite

Poker has been considered a challenging problem in both artificial intelligence and game theory because poker is characterized by imperfect information and uncertainty, which are similar to many realistic problems like auctioning, pricing, cyber security, and operations. However, it is not clear that playing an equilibrium policy in multi-player games would be wise so far, and it is infeasible to theoretically validate whether a policy is optimal. Therefore, designing an effective optimal policy learning method has more realistic significance. This paper proposes an optimal policy learning method for multi-player poker games based on Actor-Critic reinforcement learning. Firstly, this paper builds the Actor network to make decisions with imperfect information and the Critic network to evaluate policies with perfect information. Secondly, this paper proposes a novel multi-player poker policy update method: asynchronous policy update algorithm (APU) and dual-network asynchronous policy update algorithm (Dual-APU) for multi-player multi-policy scenarios and multi-player sharing-policy scenarios, respectively. Finally, this paper takes the most popular six-player Texas hold ’em poker to validate the performance of the proposed optimal policy learning method. The experiments demonstrate the policies learned by the proposed methods perform well and gain steadily compared with the existing approaches. In sum, the policy learning methods of imperfect information games based on Actor-Critic reinforcement learning perform well on poker and can be transferred to other imperfect information games. Such training with perfect information and testing with imperfect information models show an effective and explainable approach to learning an approximately optimal policy.

show abstract

Deep Reinforcement Learning with Double Q-Learning

Cited by 2,865 publications

References 8 publications

AI-Aided Design of Novel Targeted Covalent Inhibitors against SARS-CoV-2

AI-Aided Design of Novel Targeted Covalent Inhibitors against SARS-CoV-2

Koopman Operator–Based Knowledge-Guided Reinforcement Learning for Safe Human–Robot Interaction

Optimal Policy of Multiplayer Poker via Actor-Critic Reinforcement Learning

Contact Info

Product

Resources

About