Combinatorial Pure Exploration with Continuous and Separable Reward Functions and Its Applications

Huang, Weiran; Ok, Jungseul; Li, Liang; Wei, Chen

doi:10.24963/ijcai.2018/317

Cited by 56 publications

(13 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, even when Δ 2 is large, the sample complexity depends on the maximum over p k W 2 and 1−p k max(W,Δ k ) 2 , and hence W primarily determines the sample complexity, as can be seen in the order notation above. This also explains why we do better than the pure super-arm exploration algorithm COCI (Huang et al 2018) in experiments.…”

Section: Saucb Algorithmmentioning

confidence: 67%

“…Unlike Hoeffding, the lil bound is time-uniform; that is, the lil bound holds for all timesteps (avoiding a naive union bound over time). While a number of other time-uniform concentration bounds exist in the literature (Huang et al 2018;Zhao et al 2016), in practice, the Hoeffding bound works much better for us than the lil bound (see experiments). Thus, we limit ourselves to just the Hoeffding bound and lil bound.…”

Section: Variantsmentioning

confidence: 92%

“…Therefore, for a fair comparison, we allow all the baseline algorithms to use Hoeffding bounds on the arms or super-arms. The baselines we compare against are (a) naive uniform: arms are sampled in a uniform distribution, (b) COCI: prior work by (Huang et al 2018) which is a pure exploration algorithm for super-arms with weighted rewards, (c) UAS: modified SAUCB where the arms within a superarm are not selected using the derivative values but aiming for every arm in the super-arm to be sampled equally often (that is, a uniform distribution within the super-arm), and (d) Modified SE: the simple approach that combines successive elimination (Even-Dar, Mannor, and Mansour 2006) and mixed-strategy sampling as described earlier.…”

Section: Methodsmentioning

confidence: 99%

“…While this problem's goal is to identify the best super-arm (defined as a subset of single arms with additive rewards), our objective still differs as the values of our super-arms are weighted combinations of rewards of single arms. A recent work handles weighted combinations of rewards (Huang et al 2018) in an algorithm called COCI which can be used for our problem, but COCI still aims to identify the best super-arm and not bound the highest reward in an interval. We compare to COCI in experiments.…”

Section: Related Workmentioning

confidence: 99%

“…Our comparisons are for both synthetic data and for a large-scale example based on a well-known agentbased simulator of stock markets that has been used in recent papers in AI venues (Wang, Vorobeychik, and Wellman 2018). Among potential approaches in the literature, a recent work applies to our setting (Huang et al 2018) (called COCI here based on the algorithm name); however, that work does not aim for the goal mentioned in (b) above. Simple approaches combining pure arm exploration and sampling from the mixed strategy also perform poorly.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Bounding Regret in Empirical Games

Steven

Sinha

et al. 2020

AAAI

View full text Add to dashboard Cite

Empirical game-theoretic analysis refers to a set of models and techniques for solving large-scale games. However, there is a lack of a quantitative guarantee about the quality of output approximate Nash equilibria (NE). A natural quantitative guarantee for such an approximate NE is the regret in the game (i.e. the best deviation gain). We formulate this deviation gain computation as a multi-armed bandit problem, with a new optimization goal unlike those studied in prior work. We propose an efficient algorithm Super-Arm UCB (SAUCB) for the problem and a number of variants. We present sample complexity results as well as extensive experiments that show the better performance of SAUCB compared to several baselines.

show abstract

Section: Saucb Algorithmmentioning

confidence: 67%

Section: Variantsmentioning

confidence: 92%

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Bounding Regret in Empirical Games

Steven

Sinha

et al. 2020

AAAI

View full text Add to dashboard Cite

show abstract

Online learning for route planning with on-time arrival reliability

Jiang,

Samaranayake,

Zhao

2023

Operations Research Letters

View full text Add to dashboard Cite

Bandit Algorithms

2020

View full text Add to dashboard Cite

Bandit problems were introduced by William R. Thompson in an article published in 1933 in Biometrika. Thompson was interested in medical trials and the cruelty of running a trial blindly, without adapting the treatment allocations on the fly as the drug appears more or Figure 1.1 Mouse learning a T-maze.less effective. The name comes from the 1950s, when Frederick Mosteller and Robert Bush decided to study animal learning and ran trials on mice and then on humans. The mice faced the dilemma of choosing to go left or right after starting in the bottom of a T-shaped maze, not knowing each time at which end they would find food. To study a similar learning setting in humans, a 'two-armed bandit' machine was commissioned where humans could choose to pull either the left or the right arm of the machine, each giving a random pay-off with the distribution of pay-offs for each arm unknown to the human player. The machine was called a 'twoarmed bandit' in homage to the one-armed bandit, an old-fashioned name for a leveroperated slot machine ('bandit' because they steal your money).There are many reasons to care about bandit problems. Decision-making with uncertainty is a challenge we all face, and bandits provide a simple model of this dilemma. Bandit problems also have practical applications. We already mentioned clinical trial design, which researchers have used to motivate their work for 80 years. We can't point to an example where bandits have actually been used in clinical trials, but adaptive experimental design is gaining popularity and is actively encouraged by the US Food and Drug Administration, with the justification that not doing so can lead to the withholding of effective drugs until long after a positive effect has been established.While clinical trials are an important application for the future, there are applications where bandit algorithms are already in use. Major tech companies use bandit algorithms for configuring web interfaces, where applications include news recommendation, dynamic pricing and ad placement. A bandit algorithm plays a role in Monte Carlo Tree Search, an algorithm made famous by the recent success of AlphaGo.Finally, the mathematical formulation of bandit problems leads to a rich structure with connections to other branches of mathematics. In writing this book (and previous papers), we have read books on convex analysis/optimisation, Brownian motion, probability theory,

show abstract

Combinatorial Pure Exploration with Continuous and Separable Reward Functions and Its Applications

Cited by 56 publications

References 15 publications

Bounding Regret in Empirical Games

Bounding Regret in Empirical Games

Online learning for route planning with on-time arrival reliability

Bandit Algorithms

Contact Info

Product

Resources

About