In this paper, we present a purpose-built data management system, MLdp, for all machine learning (ML) datasets. ML applications pose some unique requirements different from common conventional data processing applications, including but not limited to: data lineage and provenance tracking, rich data semantics and formats, integration with diverse ML frameworks and access patterns, trial-and-error driven data exploration and evolution, rapid experimentation, reproducibility of the model training, strict compliance and privacy regulations, etc. Current ML systems/services, often named MLaaS, to-date focus on the ML algorithms, and offer no integrated data management system. Instead, they require users to bring their own data and to manage their own data on either blob storage or on file systems. The burdens of data management tasks, such as versioning and access control, fall onto the users, and not all compliance features, such as terms of use, privacy measures, and auditing, are available. MLdp offers a minimalist and flexible data model for all varieties of data, strong version management to guarantee re-producibility of ML experiments, and integration with major ML frameworks. MLdp also maintains the data provenance to help users track lineage and dependencies among data versions and models in their ML pipelines. In addition to table-stake features, such as security, availability and scalability, MLdp's internal design choices are strongly influenced by the goal to support rapid ML experiment iterations, which
Linear models are commonly used to identify trends in data. While it is an easy task to build linear models using pre-selected variables, it is challenging to select the best variables from a large number of alternatives. Most metrics for selecting variables are global in nature, and thus not useful for identifying local patterns. In this work, we present an integrated framework with visual representations that allows the user to incrementally build and verify models in three model spaces that support local pattern discovery and summarization: model complementarity, model diversity, and model representivity. Visual representations are designed and implemented for each of the model spaces. Our visualizations enable the discovery of complementary variables, i.e., those that perform well in modeling different subsets of data points. They also support the isolation of local models based on a diversity measure. Furthermore, the system integrates a hierarchical representation to identify the outlier local trends and the local trends that share similar directions in the model space. A case study on financial risk analysis is discussed, followed by a user study.
The ultimate goal of any visual analytic task is to make sense of the data and gain insights. Unfortunately, the continuously growing scale of the data nowadays challenges the traditional data analytics in the "big-data" era. Particularly, the human cognitive capabilities are constant whereas the data scale is not. Furthermore, most existing work focus on how to extract interesting information and present that to the user while not emphasizing on how to provide options to the analysts if the extracted information is not interesting. In this paper, we propose a visual analytic tool called MaVis that integrates multiple machine learning models with a plug-andplay style to describe the input data. It allows the analysts to choose the way they prefer to summarize the data. The MaVis framework provides multiple linked analytic spaces for interpretation at different levels. The low level data space handles data binning strategy while the high level model space handles model summarizations (i.e. clusters or trends). MaVis also supports model analytics that visualize the summarized patterns and compare and contrast them. This framework is shown to provide several novel methods of investigating co-movement patterns of timeseries dataset which is a common interest of medical sciences, finance, business and engineering alike. Lastly we demonstrate the usefulness of our framework via case study and user study using a stock price dataset.
We will demonstrate the visual analytics system V istream T , that supports interactive mining of complex patterns within and across live data streams and stream pattern archives. Our system is equipped with both computational pattern mining and visualization techniques, which allow it to not only efficiently discover and manage patterns but also effectively convey the mining results to human analysts through visual displays. In our demonstration, we will illustrate that with V istream T , analysts can easily submit, monitor and interact with a broad range of query types for pattern mining. This includes novel strategies for extracting complex patterns from streams in real time, summarizing neighborbased patterns using multi-resolution compression strategies, selectively pushing patterns into the stream archive, validating the popularity or rarity of stream patterns by stream archive matching, and pattern evolution tracking to link patterns across time.
A significant task within data mining is to identify data models of interest. While facilitating the exploration tasks, most visualization systems do not make use of all the data models that are generated during the exploration. In this paper, we introduce a system that allows the user to gain insights from the data space progressively by forming data models and consolidating the generated models on the fly. Each model can be a a computationally extracted or user-defined subset that contains a certain degree of interest and might lead to some discoveries. When the user generates more and more data models, the degree of interest of some portion of some models will either grow (indicating higher occurrence) or will fluctuate or decrease (corresponding to lower occurrence). Our system maintains a collection of such models and accumulates the interestingness of each model into a consolidated model. In order to consolidate the models, the system summarizes the associations between the models in the collection and identifies support (models reinforce each other), complementary (models complement each other), and overlap of the models. The accumulated interestingness keeps track of historical exploration and helps the user summarize their findings which can lead to new discoveries. This mechanism for integrating results from multiple models can be applied to a wide range of decision support systems. We demonstrate our system in a case study involving the financial status of US companies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.