Recursive partitioning methods are among the most popular techniques in machine learning. The paper investigates how these methods can be adapted to the bipartite ranking problem. In ranking, the pursued goal is global: based on past data, define an order on the whole input space X , so that positive instances take up the top ranks with maximum probability. The most natural way to order all instances consists of projecting the input data x onto the real line through a real-valued scoring function s and use the natural order on R. The accuracy of the ordering induced by a candidate s is classically measured in terms of the ROC curve or the area under the ROC curve (AUC). Here we discuss the design of tree-structured scoring functions obtained by recursively maximizing the AUC criterion. The connection with recursive piecewise linear approximation of the optimal ROC curve both in the L1-sense and in the L∞sense is highlighted. A novel tree-based algorithm, called TREER-ANK, specifically designed for learning to rank/order instances is proposed. Consistency results and generalization bounds of functional nature are established for this ranking method, when considering either the L1 or L∞ distance. Inspired from recent developments in the field of binary classification, we also describe committee-based learning procedures using TREERANK as a "base ranker", in order to overcome obvious drawbacks of such a top-down partitioning technique. Preliminary simulation results are also displayed.
Extremes play a special role in Anomaly Detection. Beyond inference and simulation purposes, probabilistic tools borrowed from Extreme Value Theory (EVT), such as the angular measure, can also be used to design novel statistical learning methods for Anomaly Detection/ranking. This paper proposes a new algorithm based on multivariate EVT to learn how to rank observations in a high dimensional space with respect to their degree of 'abnormality'. The procedure relies on an original dimension-reduction technique in the extreme domain that possibly produces a sparse representation of multivariate extremes and allows to gain insight into the dependence structure thereof, escaping the curse of dimensionality. The representation output by the unsupervised methodology we propose here can be combined with any Anomaly Detection technique tailored to non-extreme data. As it performs linearly with the dimension and almost linearly in the data (in O(dn log n)), it fits to large scale problems. The approach in this paper is novel in that EVT has never been used in its multivariate version in the field of Anomaly Detection. Illustrative experimental results provide strong empirical evidence of the relevance of our approach.
Recursive partitioning methods are among the most popular techniques in machine learning. The purpose of this paper is to investigate how to adapt this methodology to the bipartite ranking problem. ), we present tree-structured algorithms designed for learning to rank instances based on classification data. The main contributions of the present work are the following: the practical implementation of the TREERANK algorithm, well-founded solutions to the crucial issues related to the splitting rule and the choice of the "right" size for the ranking tree. From the angle embraced in this paper, splitting is viewed as a cost-sensitive classification task with data-dependent cost. Hence, up to straightforward modifications, any classification algorithm may serve as a splitting rule. Also, we propose to implement a cost-complexity pruning method after the growing stage in order to produce a "right-sized" ranking sub-tree with large AUC. In particular, performance bounds are established for pruning schemes inspired by recent work on nonparametric model selection. Eventually, we propose indicators for variable importance and variable dependence, plus various simulation studies illustrating the potential of our method.
A specific bootstrap method is introduced for positive recurrent Markov chains, based on the regenerative method and the Nummelin splitting technique. This construction involves generating a sequence of approximate pseudo-renewal times for a Harris chain X from data X 1 , . . . , X n and the parameters of a minorization condition satisfied by its transition probability kernel and then applying a variant of the methodology proposed by Datta and McCormick for bootstrapping additive functionals of type n À1 P n i¼1 f (X i ) when the chain possesses an atom. This novel methodology mainly consists in dividing the sample path of the chain into data blocks corresponding to the successive visits to the atom and resampling the blocks until the (random) length of the reconstructed trajectory is at least n, so as to mimic the renewal structure of the chain. In the atomic case we prove that our method inherits the accuracy of the bootstrap in the independent and identically distributed case up to O P (n À1 ) under weak conditions. In the general (not necessarily stationary) case asymptotic validity for this resampling procedure is established, provided that a consistent estimator of the transition kernel may be computed. The second-order validity is obtained in the stationary case (up to a rate close to O P (n À1 ) for regular stationary chains). A data-driven method for choosing the parameters of the minorization condition is proposed and applications to specific Markovian models are discussed.
Abstract.A general model is proposed for studying ranking problems. We investigate learning methods based on empirical minimization of the natural estimates of the ranking risk. The empirical estimates are of the form of a U -statistic. Inequalities from the theory of U -statistics and Uprocesses are used to obtain performance bounds for the empirical risk minimizers. Convex risk minimization methods are also studied to give a theoretical framework for ranking algorithms based on boosting and support vector machines. Just like in binary classification, fast rates of convergence are achieved under certain noise assumption. General sufficient conditions are proposed in several special cases that guarantee fast rates of convergence.
The detection of negative emotions through daily activities such as handwriting is useful for promoting well-being. The spread of human-machine interfaces such as tablets makes the collection of handwriting samples easier. In this context, we present a first publicly available handwriting database which relates emotional states to handwriting, that we call EMOTHAW. This database includes samples of 129 participants whose emotional states, namely anxiety, depression and stress, are assessed by the Depression Anxiety Stress Scales (DASS) questionnaire. Seven tasks are recorded through a digitizing tablet: pentagons and house drawing, words copied in handprint, circles and clock drawing, and one sentence copied in cursive writing. Records consist in pen positions, on-paper and inair, time stamp, pressure, pen azimuth and altitude. We report our analysis on this database.From collected data, we first compute measurements related to timing and ductus. We compute separate measurements according to the position of the writing device: on paper or in-air. We analyse and classify this set of measurements (referred to as features) using a random forest approach. This latter is a machine learning method [2], based on an ensemble of decision trees, which includes a feature ranking process. We use this ranking process to identify the features which best reveal a targeted emotional state.We then build random forest classifiers associated to each emotional state. Our results, obtained from cross-validation experiments, show that the targeted emotional states can be identified with accuracies ranging from 60% to 71%.
Background: The Cuban HIV/AIDS epidemic has the lowest prevalence rate of the Caribbean region. The objective of this paper is to give an overview of the HIV/AIDS epidemic in Cuba and to explore the reasons for this low prevalence.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.