Large volumes of data are typically used during analyses. Data preprocessing, which involves detecting outliers, handling missing data, data formatting, integration, and normalization, is essential for achieving accurate results. Many tools and methods are available for reducing preprocessing time. However, most analysts face difficulties when using them. This paper proposes a method for handling outliers and missing data, called Automated PRE-Processing for Sensor Data (APREP-S). For reducing analysis resources, we combine programming by example and machine learning via Bayesian inference, inputting human knowledge to APREP-S as an example and calculating a proper proportion by machine learning via Bayesian inference. We also define k-Shape as the calculation of the rate of similarity of time-series data. In evaluation, we use sensor data of temperature and humidity and compare the sum of the square of the errors of four methods, between original data and outputs of each methods, (1) APREP-S, (2) mean of the entire data, (3) mean of the around-the-target imputation data, and (4) spline interpolation. It is verified that APREP-S is a more suitable method for humidity data than temperature data. preprocessing method. we consider the reason is that humidity data have more changing points.
The quantity of data available for analysis, including data collected by sensors and wearable devices, has been increasing hugely. However, to obtain accurate analysis results, data pre-processing such as outlier detection, handling of missing data, and preparing data recorded by different measuring instruments in different units, is essential. Considering that the pre-processing task consumes 80% of analyst resources, we previously proposed a method to address this problem. The method integrates machine learning based on Bayesian inference with human knowledge by using programming by example approach. However, in situations in which the process of generating the model and the process of updating the model are executed at different sites, the previous method is problematic in two ways: 1) all sites have to use the same features defined when the model is generated, and 2) a helpful process to generate new training data from features without using inference data when updating the model, is not available. This prompted us to propose APREP-S, which has flexible feature processes and a process for updating the model using a clustering method. We evaluate the accuracy of the imputation and the similarity of the trends by comparing APREP-S with the original data and other existing methods. The results show that APREP-S can return the most optimal methods with both accuracy and similarity.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.