Crowdsourcing opens the door to solving a wide variety of problems that previously were unfeasible in the field of machine learning, allowing us to obtain relatively low cost labeled data in a small amount of time. However, due to the uncertain quality of labelers, the data to deal with are sometimes unreliable, forcing practitioners to collect information redundantly, which poses new challenges in the field. Despite these difficulties, many applications of machine learning using crowdsourced data have recently been published that achieved state of the art results in relevant problems. We have analyzed these applications following a systematic methodology, classifying them into different fields of study, highlighting several of their characteristics and showing the recent interest in the use of crowdsourcing for machine learning. We also identify several exciting research lines based on the problems that remain unsolved to foster future research in this field.
The goal of the Label Ranking (LR) problem is to learn preference models that predict the preferred ranking of class labels for a given unlabeled instance. Different well-known machine learning algorithms have been adapted to deal with the LR problem. In particular, fine-tuned instance-based algorithms (e.g., k-nearest neighbors) and model-based algorithms (e.g., decision trees) have performed remarkably well in tackling the LR problem. Probabilistic Graphical Models (PGMs, e.g., Bayesian networks) have not been considered to deal with this problem because of the difficulty of modeling permutations in that framework. In this paper, we propose a Hidden Naive Bayes classifier (HNB) to cope with the LR problem. By introducing a hidden variable, we can design a hybrid Bayesian network in which several types of distributions can be combined: multinomial for discrete variables, Gaussian for numerical variables, and Mallows for permutations. We consider two kinds of probabilistic models: one based on a Naive Bayes graphical structure (where only univariate probability distributions are estimated for each state of the hidden variable) and another where we allow interactions among the predictive attributes (using a multivariate Gaussian distribution for the parameter estimation). The experimental evaluation shows that our proposals are competitive with the start-of-the-art algorithms in both accuracy and in CPU time requirements.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.