Planning and Learning with Adaptive Lookahead

Rosenberg, Aviv; Hallak, Assaf; Mannor, Shie; Chechik, Gal; Dalal, Gal

doi:10.48550/arxiv.2201.12403

Cited by 1 publication

(1 citation statement)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Instead, in finite action space environments such as Atari, we compute the exact expectation in SoftTreeMax with an exhaustive TS of depth d. Despite the exponential computational cost of spanning the entire tree, recent advancements in parallel GPU-based simulation allow efficient expansion of all nodes at the same depth simultaneously (Dalal et al, 2021;Rosenberg et al, 2022). This is possible when a simulator is implemented on GPU (Dalton et al, 2020;Makoviychuk et al, 2021;Freeman et al, 2021), or when a forward model is learned (Kim et al, 2020;Ha & Schmidhuber, 2018).…”

Section: Softtreemax: Deep Parallel Implementationmentioning

confidence: 99%

SoftTreeMax: Exponential Variance Reduction in Policy Gradient via Tree Search

Dalal¹,

Hallak²,

Thoppe³

et al. 2023

Preprint

View full text Add to dashboard Cite

Despite the popularity of policy gradient methods, they are known to suffer from large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax -a generalization of softmax that takes planning into account. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We consider two variants of SoftTreeMax, one for cumulative reward and one for exponentiated reward. For both, we analyze the gradient variance and reveal for the first time the role of a tree expansion policy in mitigating this variance. We prove that the resulting variance decays exponentially with the planning horizon as a function of the expansion policy. Specifically, we show that the closer the resulting state transitions are to uniform, the faster the decay. In a practical implementation, we utilize a parallelized GPU-based simulator for fast and efficient tree search. Our differentiable tree-based policy leverages all gradients at the tree leaves in each environment step instead of the traditional single-sample-based gradient. We then show in simulation how the variance of the gradient is reduced by three orders of magnitude, leading to better sample complexity compared to the standard policy gradient. On Atari, SoftTreeMax demonstrates up to 5x better performance in a faster run time compared to distributed PPO. Lastly, we demonstrate that high reward correlates with lower variance.

show abstract