“…However, in asynchronous reinforcement learning (RL) [Tsitsiklis, 1994, Even-Dar et al, 2003, data is generated along a single Markov chain, precluding the use of stochastic optimization methods. Inspired by resampling-based inference methods in stochastic optimization, Bootstrap-based methods have been developed for linear policy evaluation tasks [White and White, 2010, Hanna et al, 2017, Hao et al, 2021, Ramprasad et al, 2021. However, they are not suitable for nonlinear tasks, such as quantifying randomness in the optimal value function.…”