Multiple Policy Value Monte Carlo Tree Search

Lan, Li-Cheng; Li, Wei; Wei, Ting-Han; Wu, I‐Chen

doi:10.24963/ijcai.2019/653

Cited by 5 publications

(3 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On-policy learning is the simplest and avoids many difficulties associated with off-policy learning, such as the ''deadly triad'', where the combination of function approximation, offpolicy learning and bootstrapping can result in divergent behaviour [22]. Despite this caveat, off-policy learning may improve the learning process by decoupling exploration from the value estimate, and multiple successful reinforcement learning algorithms have used it [10,11].…”

Section: On-policy and Off-policy Learningmentioning

confidence: 99%

Value targets in off-policy AlphaZero: a new greedy backup

Willemsen

Baier

Kaisers

2021

Neural Comput & Applic

View full text Add to dashboard Cite

published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal.If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the "Taverne" license above, please follow below link for the End User

show abstract

Section: On-policy and Off-policy Learningmentioning

confidence: 99%

Value targets in off-policy AlphaZero: a new greedy backup

Willemsen

Baier

Kaisers

2021

Neural Comput & Applic

View full text Add to dashboard Cite

show abstract

“…Many of the state-of-the-art AI programs in games such as Go, Atari, shogi, chess, and NoGo (Silver et al 2016(Silver et al , 2017(Silver et al , 2018Schrittwieser et al 2019;Lan et al 2019) use a combination of deep neural networks (DNNs) (LeCun, Bengio, and Hinton 2015) and Monte Carlo tree search (MCTS) (Browne et al 2012;Kocsis and Szepesvári 2006). Starting from 2015, DNNs have been used to extract high-level information from a given state and provide high-quality policy and state value for MCTS agents.…”

Section: Introductionmentioning

confidence: 99%

“…For example, AlphaGo used more than a thousand CPUs and 250 GPUs for playing games against Lee Sedol, not to mention the millions of self-play required during training time for AlphaZero. Many methods have been proposed to reduce the computation of PV-MCTS (Lan et al 2019;Gao, Müller, and Hayward 2018) or decrease the inference time by parallelization (Liu et al 2020). However, seldom of them consider terminating the search after the stable best action is found.…”

Section: Introductionmentioning

confidence: 99%

Learning to Stop: Dynamic Simulation Monte-Carlo Tree Search

Lan

et al. 2021

AAAI

Self Cite

View full text Add to dashboard Cite

Monte Carlo tree search (MCTS) has achieved state-of-the-art results in many domains such as Go and Atari games when combining with deep neural networks (DNNs). When more simulations are executed, MCTS can achieve higher performance but also requires enormous amounts of CPU and GPU resources. However, not all states require a long searching time to identify the best action that the agent can find. For example, in 19x19 Go and NoGo, we found that for more than half of the states, the best action predicted by DNN remains unchanged even after searching 2 minutes. This implies that a significant amount of resources can be saved if we are able to stop the searching earlier when we are confident with the current searching result. In this paper, we propose to achieve this goal by predicting the uncertainty of the current searching status and use the result to decide whether we should stop searching. With our algorithm, called Dynamic Simulation MCTS (DS-MCTS), we can speed up a NoGo agent trained by AlphaZero 2.5 times faster while maintaining a similar winning rate, which is critical for training and conducting experiments. Also, under the same average simulation count, our method can achieve a 61\% winning rate against the original program.

show abstract