Introduction to Multi-Armed Bandits

Slivkins, Aleksandrs

doi:10.1561/2200000068

Cited by 334 publications

(116 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In fact, in many circumstances, it seems rather prudent to assume that information about outcome values and probabilities are shaped by past encounters of the same decision problem. Experimentally, this configuration is often translated into multi-armed bandit problems (starting with Thompson [59], but see [60] for a review), where the decisionmaker faces abstract cues of unknown value and has to figure by trial-and-error the value of the options. Computationally, behaviour in multi-armed bandit problems is generally wellcaptured by associative or reinforcement learning processes Wu & Gonzalez [54] De Martino et al [25] Pessiglione et al [56] Fiorillo et al [53] Platt & Glimcher [52] Figure 2.…”

Section: The Experience-description Gapmentioning

confidence: 99%

The description–experience gap: a challenge for the neuroeconomics of decision-making under uncertainty

Garcia

Cerrotti

Palminteri

2021

Phil. Trans. R. Soc. B

View full text Add to dashboard Cite

The experimental investigation of decision-making in humans relies on two distinct types of paradigms, involving either description- or experience-based choices. In description-based paradigms, decision variables (i.e. payoffs and probabilities) are explicitly communicated by means of symbols. In experience-based paradigms decision variables are learnt from trial-by-trial feedback. In the decision-making literature, ‘description–experience gap’ refers to the fact that different biases are observed in the two experimental paradigms. Remarkably, well-documented biases of description-based choices, such as under-weighting of rare events and loss aversion, do not apply to experience-based decisions. Here, we argue that the description–experience gap represents a major challenge, not only to current decision theories, but also to the neuroeconomics research framework, which relies heavily on the translation of neurophysiological findings between human and non-human primate research. In fact, most non-human primate neurophysiological research relies on behavioural designs that share features of both description- and experience-based choices. As a consequence, it is unclear whether the neural mechanisms built from non-human primate electrophysiology should be linked to description-based or experience-based decision-making processes. The picture is further complicated by additional methodological gaps between human and non-human primate neuroscience research. After analysing these methodological challenges, we conclude proposing new lines of research to address them. This article is part of the theme issue ‘Existence and prevalence of economic behaviours among non-human primates’.

show abstract

Section: The Experience-description Gapmentioning

confidence: 99%

The description–experience gap: a challenge for the neuroeconomics of decision-making under uncertainty

Garcia

Cerrotti

Palminteri

2021

Phil. Trans. R. Soc. B

View full text Add to dashboard Cite

show abstract

“…The MAB problem is a purely online ML, in which the player strives to gain the maximum reward from multiple arms of slot machines [27,39]. Precisely, the MAB problem aims to detect and select, through finite trials, the arm that maximizes the long-term reward.…”

Section: General Single Player Mab Strategymentioning

confidence: 99%

Gateway Selection in Millimeter Wave UAV Wireless Networks Using Multi-Player Multi-Armed Bandit

Mohamed

Hashima

Aldosary

et al. 2020

Sensors

View full text Add to dashboard Cite

Recently, unmanned aerial vehicle (UAV)-based communications gained a lot of attention due to their numerous applications, especially in rescue services in post-disaster areas where the terrestrial network is wholly malfunctioned. Multiple access/gateway UAVs are distributed to fully cover the post-disaster area as flying base stations to provide communication coverage, collect valuable information, disseminate essential instructions, etc. The access UAVs after gathering/broadcasting the necessary information should select and fly towards one of the surrounding gateways for relaying their information. In this paper, the gateway UAV selection problem is addressed. The main aim is to maximize the long-term average data rates of the UAVs relays while minimizing the flights’ battery cost, where millimeter wave links, i.e., using 30~300 GHz band, employing antenna beamforming, are used for backhauling. A tool of machine learning (ML) is exploited to address the problem as a budget-constrained multi-player multi-armed bandit (MAB) problem. In this setup, access UAVs act as the players, and the arms are the gateway UAVs, while the rewards are the average data rates of the constructed relays constrained by the battery cost of the access UAV flights. In this decentralized setting, where information is neither prior available nor exchanged among UAVs, a selfish and concurrent multi-player MAB strategy is suggested. Towards this end, three battery-aware MAB (BA-MAB) algorithms, namely upper confidence bound (UCB), Thompson sampling (TS), and the exponential weight algorithm for exploration and exploitation (EXP3), are proposed to realize gateways selection efficiently. The proposed BA-MAB-based gateway UAV selection algorithms show superior performance over approaches based on near and random selections in terms of total system rate and energy efficiency.

show abstract

“…Multi-Armed Bandit (MAB) is a powerful framework that allows agents to solve sequential decision making problems under uncertainty [ 16 ]. In the standard version, an algorithm has K possible actions (or arms) to choose from and T rounds (or time-steps).…”

Section: Introductionmentioning

confidence: 99%

Non Stationary Multi-Armed Bandit: Empirical Evaluation of a New Concept Drift-Aware Algorithm

Cavenaghi

Sottocornola

Stella

et al. 2021

Entropy

View full text Add to dashboard Cite

The Multi-Armed Bandit (MAB) problem has been extensively studied in order to address real-world challenges related to sequential decision making. In this setting, an agent selects the best action to be performed at time-step t, based on the past rewards received by the environment. This formulation implicitly assumes that the expected payoff for each action is kept stationary by the environment through time. Nevertheless, in many real-world applications this assumption does not hold and the agent has to face a non-stationary environment, that is, with a changing reward distribution. Thus, we present a new MAB algorithm, named f-Discounted-Sliding-Window Thompson Sampling (f-dsw TS), for non-stationary environments, that is, when the data streaming is affected by concept drift. The f-dsw TS algorithm is based on Thompson Sampling (TS) and exploits a discount factor on the reward history and an arm-related sliding window to contrast concept drift in non-stationary environments. We investigate how to combine these two sources of information, namely the discount factor and the sliding window, by means of an aggregation function f(.). In particular, we proposed a pessimistic (f=min), an optimistic (f=max), as well as an averaged (f=mean) version of the f-dsw TS algorithm. A rich set of numerical experiments is performed to evaluate the f-dsw TS algorithm compared to both stationary and non-stationary state-of-the-art TS baselines. We exploited synthetic environments (both randomly-generated and controlled) to test the MAB algorithms under different types of drift, that is, sudden/abrupt, incremental, gradual and increasing/decreasing drift. Furthermore, we adapt four real-world active learning tasks to our framework—a prediction task on crimes in the city of Baltimore, a classification task on insects species, a recommendation task on local web-news, and a time-series analysis on microbial organisms in the tropical air ecosystem. The f-dsw TS approach emerges as the best performing MAB algorithm. At least one of the versions of f-dsw TS performs better than the baselines in synthetic environments, proving the robustness of f-dsw TS under different concept drift types. Moreover, the pessimistic version (f=min) results as the most effective in all real-world tasks.

show abstract

Introduction to Multi-Armed Bandits

Abstract: Also available as a combined paper and online subscription.

Cited by 334 publications

References 27 publications

The description–experience gap: a challenge for the neuroeconomics of decision-making under uncertainty

The description–experience gap: a challenge for the neuroeconomics of decision-making under uncertainty

Gateway Selection in Millimeter Wave UAV Wireless Networks Using Multi-Player Multi-Armed Bandit

Non Stationary Multi-Armed Bandit: Empirical Evaluation of a New Concept Drift-Aware Algorithm

Contact Info

Product

Resources

About