On Worst-case Regret of Linear Thompson Sampling

Hamidi, Nima; Bayati, Mohsen

doi:10.48550/arxiv.2006.06790

Cited by 4 publications

(4 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Comparison with Different Adaptive Approaches It is worth noting that a similar adaptive approach has been considered in the Gaussian model with known variance (Jin et al 2021) and linear models (Hamidi and Bayati 2020). In these approaches, the posterior distribution was modeled as a Gaussian distribution and an adaptive inflation value ρ t was introduced to the scale parameter, which effectively flattened the posterior distributions.…”

Section: Thompson Sampling With Truncationmentioning

confidence: 99%

The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models

Lee,

Chiang,

Sugiyama

2024

AAAI

View full text Add to dashboard Cite

Thompson sampling (TS) has been known for its outstanding empirical performance supported by theoretical guarantees across various reward models in the classical stochastic multi-armed bandit problems. Nonetheless, its optimality is often restricted to specific priors due to the common observation that TS is fairly insensitive to the choice of the prior when it comes to asymptotic regret bounds. However, when the model contains multiple parameters, the optimality of TS highly depends on the choice of priors, which casts doubt on the generalizability of previous findings to other models. To address this gap, this study explores the impact of selecting noninformative priors, offering insights into the performance of TS when dealing with new models that lack theoretical understanding. We first extend the regret analysis of TS to the model of uniform distributions with unknown supports, which would be the simplest non-regular model. Our findings reveal that changing noninformative priors can significantly affect the expected regret, aligning with previously known results in other multiparameter bandit models. Although the uniform prior is shown to be optimal, we highlight the inherent limitation of its optimality, which is limited to specific parameterizations and emphasizes the significance of the invariance property of priors. In light of this limitation, we propose a slightly modified TS-based policy, called TS with Truncation (TS-T), which can achieve the asymptotic optimality for the Gaussian models and the uniform models by using the reference prior and the Jeffreys prior that are invariant under one-to-one reparameterizations. This policy provides an alternative approach to achieving optimality by employing fine-tuned truncation, which would be much easier than hunting for optimal priors in practice.

show abstract

Section: Thompson Sampling With Truncationmentioning

confidence: 99%

The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models

Lee,

Chiang,

Sugiyama

2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Remark 3.2. As shown in Hamidi and Bayati (2020), the assumption that LinTS uses the true posterior distribution for Θ is crucial, as the Bayesian regret of LinTS can grow linearly for exp(Cd) rounds for some constant C > 0.…”

Section: Linear Thompson Samplingmentioning

confidence: 99%

The Elliptical Potential Lemma for General Distributions with an Application to Linear Thompson Sampling

Hamidi¹,

Bayati²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this note, we introduce a randomized version of the well-known elliptical potential lemma that is widely used in the analysis of algorithms in sequential learning and decisionmaking problems such as stochastic linear bandits. Our randomized elliptical potential lemma relaxes the Gaussian assumption on the observation noise and on the prior distribution of the problem parameters. We then use this generalization to prove an improved Bayesian regret bound for Thompson sampling for the linear stochastic bandits with changing action sets where prior and noise distributions are general. This bound is minimax optimal up to constants.

show abstract

“…Accordingly, the study of theoretical performance guarantees for Thompson sampling has gained much popularity and made significant progress in the recent literature with an emphasis on high-probability instance-dependent regret. First, regret bounds growing as square-root of time were shown for adversarial contextual bandits (Agrawal and Goyal, 2013;Russo and Van Roy, 2014;Abeille and Lazaric, 2017), succeeded by a square-root regret bound for settings with a Euclidean action set (Hamidi and Bayati, 2020) and logarithmic regret bound for stochastic contextual bandits with a shared reward parameter (Chakraborty et al, 2023). In particular, in the latter case (that the rewards of different arms share the unknown parameter), the regret of Thompson sampling can still be logarithmic with time, if the observations are noisy versions of the stochastic context vectors and the same dimension Faradonbeh, 2021, 2022a).…”

Section: Introductionmentioning

confidence: 99%

Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits

Park

Faradonbeh

2022

IEEE Control Syst. Lett.

View full text Add to dashboard Cite

Contextual bandits constitute a classical framework for decision-making under uncertainty. In this setting, the goal is to learn the arms of highest reward subject to contextual information, while the unknown reward parameters of each arm need to be learned by experimenting that specific arm. Accordingly, a fundamental problem is that of balancing exploration (i.e., pulling different arms to learn their parameters), versus exploitation (i.e., pulling the best arms to gain reward). To study this problem, the existing literature mostly considers perfectly observed contexts. However, the setting of partial context observations remains unexplored to date, despite being theoretically more general and practically more versatile. We study bandit policies for learning to select optimal arms based on the data of observations, which are noisy linear functions of the unobserved context vectors. Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation. Specifically, we establish the followings: (i) regret bounds that grow poly-logarithmically with time, (ii) square-root consistency of parameter estimation, and (iii) scaling of the regret with other quantities including dimensions and number of arms. Extensive numerical experiments with both real and synthetic data are presented as well, corroborating the efficacy of Thompson sampling. To establish the results, we introduce novel martingale techniques and concentration inequalities to address partially observed dependent random variables generated from unspecified distributions, and also leverage problem-dependent information to sharpen probabilistic bounds for time-varying suboptimality gaps. These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.

show abstract

On Worst-case Regret of Linear Thompson Sampling

Cited by 4 publications

References 8 publications

The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models

The Choice of Noninformative Priors for Thompson Sampling in Multiparameter Bandit Models

The Elliptical Potential Lemma for General Distributions with an Application to Linear Thompson Sampling

Analysis of Thompson Sampling for Partially Observable Contextual Multi-Armed Bandits

Contact Info

Product

Resources

About