The aim of multi-output learning is to simultaneously predict multiple outputs given an input. It is an important learning problem for decision-making, since making decisions in the real world often involves multiple complex factors and criteria. In recent times, an increasing number of research studies have focused on ways to predict multiple outputs at once. Such efforts have transpired in different forms according to the particular multi-output learning problem under study. Classic cases of multi-output learning include multi-label learning, multidimensional learning, multi-target regression and others. From our survey of the topic, we were struck by a lack in studies that generalize the different forms of multi-output learning into a common framework. This paper fills that gap with a comprehensive review and analysis of the multi-output learning paradigm. In particular, we characterize the 4 Vs of multi-output learning, i.e., volume, velocity, variety, and veracity, and the ways in which the 4 Vs both benefit and bring challenges to multioutput learning by taking inspiration from big data. We analyze the life cycle of output labeling, present the main mathematical definitions of multi-output learning, and examine the field's key challenges and corresponding solutions as found in the literature. Several model evaluation metrics and popular data repositories are also discussed. Last but not least, we highlight some emerging challenges with multi-output learning from the perspective of the 4 Vs as potential research directions worthy of further studies.
Multi-output learning with the task of simultaneously predicting multiple outputs for an input has increasingly attracted interest from researchers due to its wide application. The k nearest neighbor (kNN) algorithm is one of the most popular frameworks for handling multi-output problems. The performance of kNN depends crucially on the metric used to compute the distance between different instances. However, our experiment results show that the existing advanced metric learning technique cannot provide an appropriate distance metric for multi-output tasks. This paper systematically studies how to learn an appropriate distance metric for multi-output problems. In particular, we present a novel large margin metric learning paradigm for multi-output tasks, which projects both the input and output into the same embedding space and then learns a distance metric to discover output dependency such that instances with very different multiple outputs will be moved far away. Several strategies are then proposed to speed up the training and testing time. Moreover, we study the generalization error bound of our method, which shows that our method is able to tighten the excess risk bounds. Experiments on three multi-output learning tasks (multi-label classification, multi-target regression, and multi-concept retrieval) validate the effectiveness and scalability of the proposed method.
Approximate nearest neighbor (ANN) search has achieved great success in many tasks. However, existing popular methods for ANN search, such as hashing and quantization methods, are designed for static databases only. They cannot handle well the database with data distribution evolving dynamically, due to the high computational effort for retraining the model based on the new database. In this paper, we address the problem by developing an online product quantization (online PQ) model and incrementally updating the quantization codebook that accommodates to the incoming streaming data. Moreover, to further alleviate the issue of large scale computation for the online PQ update, we design two budget constraints for the model to update partial PQ codebook instead of all. We derive a loss bound which guarantees the performance of our online PQ model. Furthermore, we develop an online PQ model over a sliding window with both data insertion and deletion supported, to reflect the real-time behaviour of the data. The experiments demonstrate that our online PQ model is both time-efficient and effective for ANN search in dynamic large scale databases compared with baseline methods and the idea of partial PQ codebook update further reduces the update cost. Index Terms-Online indexing model, product quantization, nearest neighbour search.!
Conducting (big) data analytics in an organization is not just about using a processing framework (e.g. Hadoop/Spark) to learn a model from data currently in a single file system (e.g. HDFS). We frequently need to pipeline real time data from other systems into the processing framework, and continually update the learned model. The processing frameworks need to be easily invokable for different purposes to produce different models. The model and the subsequent model updates need to be integrated with a product that may require a real time prediction using the latest trained model. All these need to be shared among different teams in the organization for different data analytics purposes. In this paper, we propose a real time data-analytics-as-service architecture that uses RESTful web services to wrap and integrate data services, dynamic model training services (supported by big data processing framework), prediction services and the product that uses the models. We discuss the challenges in wrapping big data processing frameworks as services and other architecturally significant factors that affect system reliability, real time performance and prediction accuracy. We evaluate our architecture using a log-driven system operation anomaly detection system where staleness of data used in model training, speed of model update and prediction are critical requirements.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.