“…Most of the literature focused on the analysis of the Bayesian regret of TS for general settings such as linear bandits or reinforcement learning (see e.g., [Osband and Van Roy, 2015]). More recently, [Russo and Van Roy, 2016, Dong and Van Roy, 2018, Dong et al, 2019 provided an information-theoretic analysis of TS, where the key tool in their approach is the information ratio which quantifies the trade-off between exploration and exploitation. Additionally, [Gopalan and Mannor, 2015] provides regret guarantees for TS in the finite and infinite MDP setting.…”