Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence 2019
DOI: 10.24963/ijcai.2019/653
|View full text |Cite
|
Sign up to set email alerts
|

Multiple Policy Value Monte Carlo Tree Search

Abstract: Many of the strongest game playing programs use a combination of Monte Carlo tree search (MCTS) and deep neural networks (DNN), where the DNNs are used as policy or value evaluators. Given a limited budget, such as online playing or during the self-play phase of AlphaZero (AZ) training, a balance needs to be reached between accurate state estimation and more MCTS simulations, both of which are critical for a strong game playing agent. Typically, larger DNNs are better at generalization and accurate evaluation,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 3 publications
0
3
0
Order By: Relevance
“…On-policy learning is the simplest and avoids many difficulties associated with off-policy learning, such as the ''deadly triad'', where the combination of function approximation, offpolicy learning and bootstrapping can result in divergent behaviour [22]. Despite this caveat, off-policy learning may improve the learning process by decoupling exploration from the value estimate, and multiple successful reinforcement learning algorithms have used it [10,11].…”
Section: On-policy and Off-policy Learningmentioning
confidence: 99%
“…On-policy learning is the simplest and avoids many difficulties associated with off-policy learning, such as the ''deadly triad'', where the combination of function approximation, offpolicy learning and bootstrapping can result in divergent behaviour [22]. Despite this caveat, off-policy learning may improve the learning process by decoupling exploration from the value estimate, and multiple successful reinforcement learning algorithms have used it [10,11].…”
Section: On-policy and Off-policy Learningmentioning
confidence: 99%
“…Many of the state-of-the-art AI programs in games such as Go, Atari, shogi, chess, and NoGo (Silver et al 2016(Silver et al , 2017(Silver et al , 2018Schrittwieser et al 2019;Lan et al 2019) use a combination of deep neural networks (DNNs) (LeCun, Bengio, and Hinton 2015) and Monte Carlo tree search (MCTS) (Browne et al 2012;Kocsis and Szepesvári 2006). Starting from 2015, DNNs have been used to extract high-level information from a given state and provide high-quality policy and state value for MCTS agents.…”
Section: Introductionmentioning
confidence: 99%
“…For example, AlphaGo used more than a thousand CPUs and 250 GPUs for playing games against Lee Sedol, not to mention the millions of self-play required during training time for AlphaZero. Many methods have been proposed to reduce the computation of PV-MCTS (Lan et al 2019;Gao, Müller, and Hayward 2018) or decrease the inference time by parallelization (Liu et al 2020). However, seldom of them consider terminating the search after the stable best action is found.…”
Section: Introductionmentioning
confidence: 99%