Achieving complete learning in Multi-Armed Bandit problems

Vakili, Sattar; Zhao, Qing

doi:10.1109/acssc.2013.6810607

Cited by 7 publications

(9 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We design algorithms that exploit the prior information in all objectives simultaneously to rule out arms that are not lexicographic optimal. Our regret bounds match the ones in Garivier et al (2018) and improve the ones in Vakili and Zhao (2013) for the case with a single objective.…”

Section: Related Worksupporting

confidence: 62%

“…Bubeck and Liu (2013) considers Thompson sampling and shows that its regret is uniformly bounded when * and a positive lower bound on are known. On the other hand, Vakili and Zhao (2013) considers a weaker prior information model where the learner knows a near-optimal expected reward , which can be computed using * and a positive lower bound on . The proposed algorithm obtains ∑ a a ∕ 3 regret, where 𝛿 = 𝜇 * − 𝜂 < 𝛥 and a is the suboptimality gap of arm a.…”

Section: Related Workmentioning

confidence: 99%

“…The challenges described above motivates us to consider the cases where the learner has prior knowledge on the expected rewards in addition to the much more challenging priorfree case. Specifically, we consider two types of prior knowledge, which generalize the prior knowledge introduced in and Vakili and Zhao (2013) to multidimensional rewards. In the first case, we assume that the expected rewards of a lexicographic optimal arm are known.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives

Hüyük

Tekin

2021

Mach Learn

View full text Add to dashboard Cite

We consider multi-objective multi-armed bandit with (i) lexicographically ordered and (ii) satisficing objectives. In the first problem, the goal is to select arms that are lexicographic optimal as much as possible without knowing the arm reward distributions beforehand. We capture this goal by defining a multi-dimensional form of regret that measures the loss due to not selecting lexicographic optimal arms, and then, propose an algorithm that achieves Õ(T 2∕3 ) gap-free regret and prove a regret lower bound of (T 2∕3 ) . We also consider two additional settings where the learner has prior information on the expected arm rewards. In the first setting, the learner only knows for each objective the lexicographic optimal expected reward. In the second setting, it only knows for each objective a near-lexicographic optimal expected reward. For both settings, we prove that the learner achieves expected regret uniformly bounded in time. Then, we show that the algorithm we propose for the second setting of lexicographically ordered objectives with prior information also attains bounded regret for satisficing objectives. Finally, we experimentally evaluate the proposed algorithms in a variety of multi-objective learning problems.

show abstract

Section: Related Worksupporting

confidence: 62%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives

Hüyük

Tekin

2021

Mach Learn

View full text Add to dashboard Cite

show abstract

“…There are standard techniques for such extensions by replacing the concentration result with the corresponding ones for lighttailed and heavy-tailed distributions (the latter also requires replacing sample means with truncated sample means). Similar extensions for classic MAB problems without side information are discussed in [4], [34]. To illuminate the main ideas without too much technicality, most existing work assumes an even stronger assumption of bounded support in [0, 1] (see [2], [3], [24], etc.…”

Section: Extensions To Other Distributionsmentioning

confidence: 99%

Multi-Armed Bandits on Partially Revealed Unit Interval Graphs

Vakili²,

Zhao

et al. 2020

IEEE Trans. Netw. Sci. Eng.

Self Cite

View full text Add to dashboard Cite

A stochastic multi-armed bandit problem with side information on the similarity and dissimilarity across different arms is considered. The action space of the problem can be represented by a unit interval graph (UIG) where each node represents an arm and the presence (absence) of an edge between two nodes indicates similarity (dissimilarity) between their mean rewards. Two settings of complete and partial side information based on whether the UIG is fully revealed are studied and a general two-step learning structure consisting of an offline reduction of the action space and online aggregation of reward observations from similar arms is proposed to fully exploit the topological structure of the side information. In both cases, the computation efficiency and the order optimality of the proposed learning policies in terms of both the size of the action space and the time length are established.Index Terms-Multi-armed bandits, unit interval graph, side information.✦

show abstract

“…3 The challenges described above motivates us to focus on the cases when the learner has prior knowledge on expected rewards. Specifically, we consider two types of prior knowledge, which generalize the prior knowledge introduced in [8] and [9] to multidimensional rewards. In the first case, we assume that the expected rewards of a lexicographic optimal arm are known.…”

Section: Introductionmentioning

confidence: 99%

Lexicographic Multiarmed Bandit

Hüyük,

Tekin

2019

Preprint

View full text Add to dashboard Cite

We consider a multiobjective multiarmed bandit problem with lexicographically ordered objectives. In this problem, the goal of the learner is to select arms that are lexicographic optimal as much as possible without knowing the arm reward distributions beforehand. We capture this goal by defining a multidimensional form of regret that measures the loss of the learner due to not selecting lexicographic optimal arms, and then, consider two settings where the learner has prior information on the expected arm rewards. In the first setting, the learner only knows for each objective the lexicographic optimal expected reward. In the second setting, it only knows for each objective near-lexicographic optimal expected rewards. For both settings we prove that the learner achieves expected regret uniformly bounded in time. The algorithm we propose for the second setting also attains bounded regret for the multiarmed bandit with satisficing objectives. In addition, we also consider the harder prior-free case, and show that the learner can still achieve sublinear in time gap-free regret. Finally, we experimentally evaluate performance of the proposed algorithms in a variety of multiobjective learning problems.

show abstract

Achieving complete learning in Multi-Armed Bandit problems

Cited by 7 publications

References 6 publications

Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives

Multi-objective multi-armed bandit with lexicographically ordered and satisficing objectives

Multi-Armed Bandits on Partially Revealed Unit Interval Graphs

Lexicographic Multiarmed Bandit

Contact Info

Product

Resources

About