We consider the problem of stochastic K-armed dueling bandit in the contextual setting, where at each round the learner is presented with a context set of K items, each represented by a d-dimensional feature vector, and the goal of the learner is to identify the best arm of each context sets. However, unlike the classical contextual bandit setup, our framework only allows the learner to receive item feedback in terms of their (noisy) pariwise preferences-famously studied as dueling bandits which is practical interests in various online decision making scenarios, e.g. recommender systems, information retrieval, tournament ranking, where it is easier to elicit the relative strength of the items instead of their absolute scores. However, to the best of our knowledge this work is the first to consider the problem of regret minimization of contextual dueling bandits for potentially infinite decision spaces and gives provably optimal algorithms along with a matching lower bound analysis. We present two algorithms for the setup with respective regret guarantees Õ(d √ T ) and Õ( √ dT log K). Subsequently we also show that Ω( √ dT ) is actually the fundamental performance limit for this problem, implying the optimality of our second algorithm. However the analysis of our first algorithm is comparatively simpler, and it is often shown to outperform the former empirically. Finally, we corroborate all the theoretical results with suitable experiments.
We consider the problem of preference based reinforcement learning (PbRL), where, unlike traditional reinforcement learning, an agent receives feedback only in terms of a 1 bit (0/1) preference over a trajectory pair instead of absolute rewards for them. The success of the traditional RL framework crucially relies on the underlying agent-reward model, which, however, depends on how accurately a system designer can express an appropriate reward function and often a non-trivial task. The main novelty of our framework is the ability to learn from preferencebased trajectory feedback that eliminates the need to hand-craft numeric reward models. This paper sets up a formal framework for the PbRL problem with non-markovian rewards, where the trajectory preferences are encoded by a generalized linear model of dimension d. Assuming the transition model is known, we then propose an algorithm with almost optimal regret guarantee of Õ SHd log(T /δ) √ T . We further, extend the above algorithm to the case of unknown transition dynamics, and provide an algorithm with near optimal regret guarantee O((To the best of our knowledge, our work is one of the first to give tight regret guarantees for preference based RL problem with trajectory preferences.
We study the problem of dynamic regret minimization in K-armed Dueling Bandits under non-stationary or time varying preferences. This is an online learning setup where the agent chooses a pair of items at each round and observes only a relative binary 'win-loss' feedback for this pair, sampled from an underlying preference matrix at that round. We first study the problem of static-regret minimization for adversarial preference sequences and design an efficient algorithm with O( √ KT ) high probability regret. We next use similar algorithmic ideas to propose an efficient and provably optimal algorithm for dynamic-regret minimization under two notions of non-stationarities. In particular, we establish Õ( √ SKT ) and Õ(V) dynamic-regret guarantees, S being the total number of 'effective-switches' in the underlying preference relations and V T being a measure of 'continuous-variation' non-stationarity. The complexity of these problems have not been studied prior to this work despite the practicability of non-stationary environments in real world systems. We justify the optimality of our algorithms by proving matching lower bound guarantees under both the above-mentioned notions of non-stationarities. Finally, we corroborate our results with extensive simulations and compare the efficacy of our algorithms over state-of-the-art baselines.
The problem of predicting outcome of patients in intensive care units (ICUs) is of great importance in critical care medicine, and has wide implications for quality control in ICUs. A dominant approach to this problem has been to use an ICU score system such as, for example, the Acute Physiology and Chronic Health Evaluation (APACHE) system, and the Simplified Acute Physiology Score (SAPS) system, to compute a certain severity score for a patient from a set of clinical observations, and apply a logistic regression model on this score to obtain an estimate of the probability of mortality for the patient; owing to their simplicity, these methods are widely used by clinicians. However, existing ICU score systems are built from a fixed set of patient data, and often perform poorly when applied to a patient population with different characteristics; also, with changes in patient characteristics, a score system built from a given patient data set becomes suboptimal over time. Moreover, most of these score systems are built using semi-automated procedures that require some amount of manual intervention, making it difficult to adapt them to a new patient population.Thus there is a huge need for adaptive methods that can automatically learn predictive models from a given set of patient data, tailored to perform well on similar patient populations. Indeed, there has been much work in recent years on applying various machine learning methods to this problem; however these methods learn different representations from the score systems preferred by clinicians. In this work, we develop a machine learning method based on orthogonal matching pursuit that automatically learns a score system type model, which enjoys the benefits of both worlds: like other machine learning methods, it is adaptive; like standard score systems, it uses a representation that is easy for clinicians to understand. Experiments on real-world patient data sets show that our method outperforms standard ICU score systems, and performs at least as well as other machine learning methods that employ more complex representations. As an added advantage of using the OMP approach, one can use a group-sparse variant of OMP which allows learning models with similar performance using a smaller number of clinical observations; we include experiments with this as well.
We study the K-armed contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes preference-based feedback suggesting that one decision was better than the other. We focus on the regret minimization problem under realizability, where the feedback is generated by a pairwise preference matrix that is wellspecified by a given function class F . We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works. The algorithm is also computationally efficient, running in polynomial time assuming access to an online oracle for square loss regression over F . This resolves an open problem of on oracle efficient, regret-optimal algorithms for contextual dueling bandits.
We consider the problem of optimal recovery of true ranking of n items from a randomly chosen subset of their pairwise preferences. It is well known that without any further assumption, one requires a sample size of Ω(n 2 ) for the purpose. We analyze the problem with an additional structure of relational graph G([n], E) over the n items added with an assumption of locality: Neighboring items are similar in their rankings. Noting the preferential nature of the data, we choose to embed not the graph, but, its strong product to capture the pairwise node relationships. Furthermore, unlike existing literature that uses Laplacian embedding for graph based learning problems, we use a richer class of graph embeddings-orthonormal representations-that includes (normalized) Laplacian as its special case. Our proposed algorithm, Pref-Rank, predicts the underlying ranking using an SVM based approach over the chosen embedding of the product graph, and is the first to provide statistical consistency on two ranking losses: Kendall's tau and Spearman's footrule, with a required sample complexity of O(n 2 χ(Ḡ)) 2 3 pairs, χ(Ḡ) being the chromatic number of the complement graphḠ. Clearly, our sample complexity is smaller for dense graphs, with χ(Ḡ) characterizing the degree of node connectivity, which is also intuitive due to the locality assumption e.g. O(n 4 3 ) for union of k-cliques, or O(n 5 3 ) for random and power law graphs etc.-a quantity much smaller than the fundamental limit of Ω(n 2 ) for large n. This, for the first time, relates ranking complexity to structural properties of the graph. We also report experimental evaluations on different synthetic and real datasets, where our algorithm is shown to outperform the state-of-the-art methods.
We consider the problem of ranking a set of items from pairwise comparisons in the presence of features associated with the items. Recent works have established that O(n log(n)) samples are needed to rank well when there is no feature information present. However, this might be sub-optimal in the presence of associated features. We introduce a new probabilistic preference model called feature-Bradley-Terry-Luce (f-BTL) model that generalizes the standard BTL model to incorporate feature information. We present a new least squares based algorithm called fBTL-LS which we show requires much lesser than O(n log(n)) pairs to obtain a good ranking -precisely our new sample complexity bound is of O(α log α), where α denotes the number of 'independent items' of the set, in general α << n. Our analysis is novel and makes use of tools from classical graph matching theory to provide tighter bounds that sheds light on the true complexity of the ranking problem, capturing the item dependencies in terms of their feature representations. This was not possible with earlier matrix completion based tools used for this problem. We also prove an information theoretic lower bound on the required sample complexity for recovering the underlying ranking, which essentially shows the tightness of our proposed algorithms. The efficacy of our proposed algorithms are validated through extensive experimental evaluations on a variety of synthetic and real world datasets.
We consider combinatorial online learning with subset choices when only relative feedback information from subsets is available, instead of bandit or semi-bandit feedback which is absolute. Specifically, we study two regret minimisation problems over subsets of a finite ground set [n], with subset-wise relative preference information feedback according to the Multinomial logit choice model. In the first setting, the learner can play subsets of size bounded by a maximum size and receives top-m rank-ordered feedback, while in the second setting the learner can play subsets of a fixed size k with a full subset ranking observed as feedback. For both settings, we devise instance-dependent and order-optimal regret algorithms with regret O( n m ln T ) and O( n k ln T ), respectively. We derive fundamental limits on the regret performance of online learning with subset-wise preferences, proving the tightness of our regret guarantees. Our results also show the value of eliciting more general top-m rankordered feedback over single winner feedback (m = 1). Our theoretical results are corroborated with empirical evaluations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.