Stream data are often dirty, for example, owing to unreliable sensor reading or erroneous extraction of stock prices. Most stream data cleaning approaches employ a smoothing filter, which may seriously alter the data without preserving the original information. We argue that the cleaning should avoid changing those originally correct/clean data, a.k.a. the minimum modification rule in data cleaning. To capture the knowledge about what is clean , we consider the (widely existing) constraints on the speed and acceleration of data changes, such as fuel consumption per hour, daily limit of stock prices, or the top speed and acceleration of a car. Guided by these semantic constraints, in this article, we propose the constraint-based approach for cleaning stream data. It is notable that existing data repair techniques clean (a sequence of) data as a whole and fail to support stream computation. To this end, we have to relax the global optimum over the entire sequence to the local optimum in a window. Rather than the commonly observed NP-hardness of general data repairing problems, our major contributions include (1) polynomial time algorithm for global optimum, (2) linear time algorithm towards local optimum under an efficient median-based solution , and (3) experiments on real datasets demonstrate that our method can show significantly lower L1 error than the existing approaches such as smoother.
Incomplete information often occur along with many database applications, e.g., in data integration, data cleaning or data exchange. The idea of data imputation is to fill the missing data with the values of its neighbors who share the same information. Such neighbors could either be identified certainly by editing rules or statistically by relational dependency networks. Unfortunately, owing to data sparsity, the number of neighbors (identified w.r.t. value equality) is rather limited, especially in the presence of data values with variances. In this paper, we argue to extensively enrich similarity neighbors by similarity rules with tolerance to small variations. More fillings can thus be acquired that the aforesaid equality neighbors fail to reveal. To fill the missing values more, we study the problem of maximizing the missing data imputation. Our major contributions include (1) the np-hardness analysis on solving and approximating the problem, (2) exact algorithms for tackling the problem, and (3) efficient approximation with performance guarantees. Experiments on real and synthetic data sets demonstrate that the filling accuracy can be improved.
Missing numerical values are prevalent, e.g., owing to unreliable sensor reading, collection and transmission among heterogeneous sources. Unlike categorized data imputation over a limited domain, the numerical values suffer from two issues:(1) sparsity problem, the incomplete tuple may not have sufficient complete neighbors sharing the same/similar values for imputation, owing to the (almost) infinite domain; (2) heterogeneity problem, different tuples may not fit the same (regression) model. In this study, enlightened by the conditional dependencies that hold conditionally over certain tuples rather than the whole relation, we propose to learn a regression model individually for each complete tuple together with its neighbors. Our IIM, Imputation via Individual Models, thus no longer relies on sharing similar values among the k complete neighbors for imputation, but utilizes their regression results by the aforesaid learned individual (not necessary the same) models. Remarkably, we show that some existing methods are indeed special cases of our IIM, under the extreme settings of the number ℓ of learning neighbors considered in individual learning. In this sense, a proper number ℓ of neighbors is essential to learn the individual models (avoid over-fitting or under-fitting). We propose to adaptively learn individual models over various number ℓ of neighbors for different complete tuples. By devising efficient incremental computation, the time complexity of learning a model reduces from linear to constant. Experiments on real data demonstrate that our IIM with adaptive learning achieves higher imputation accuracy than the existing approaches.
Errors are prevalent in time series data, such as GPS trajectories or sensor readings. Existing methods focus more on anomaly detection but not on repairing the detected anomalies. By simply filtering out the dirty data via anomaly detection, applications could still be unreliable over the incomplete time series. Instead of simply discarding anomalies, we propose to (iteratively) repair them in time series data, by creatively bonding the beauty of temporal nature in anomaly detection with the widely considered minimum change principle in data repairing. Our major contributions include: (1) a novel framework of iterative minimum repairing (IMR) over time series data, (2) explicit analysis on convergence of the proposed iterative minimum repairing, and (3) efficient estimation of parameters in each iteration. Remarkably, with incremental computation, we reduce the complexity of parameter estimation from O ( n ) to O (1). Experiments on real datasets demonstrate the superiority of our proposal compared to the state-of-the-art approaches. In particular, we show that (the proposed) repairing indeed improves the time series classification application.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.