Stream Data Cleaning under Speed and Acceleration Constraints

Song, Shaoxu; Gao, Fei; Zhang, Aoqian; Wang, Jianmin; Yu, Philip S.

doi:10.1145/3465740

Cited by 28 publications

(51 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DEFINITION Note that, instead of setting 𝜖 𝑌 .𝑚𝑖𝑛 to 0 in [19,41], in this paper, we relax this limitation to let the 𝜖 𝑌 .𝑚𝑖𝑛 be any non-negative value less than 𝜖 𝑌 .𝑚𝑎𝑥 (i.e., 0 ≤ 𝜖 𝑌 .𝑚𝑖𝑛 < 𝜖 𝑌 .𝑚𝑎𝑥), such that the CDD rule can have tighter intervals for distance constraints. CDD Rule Detection: We assume that a static data repository 𝑅 is available, which can be collected/inferred by historical stream data [23,37,38,44]. Following the literature [19,41], to infer a CDD rule in the form 𝑋 → 𝐴 𝑗 from 𝑅, we first obtain determinant attributes 𝑋 from (𝑑-1) attributes (other than 𝐴 𝑗 ), where attributes 𝑋 are correlated with 𝐴 𝑗 in 𝑅.…”

Section: Imputation Over Incomplete Data Streammentioning

confidence: 99%

“…In this paper, we consider the missing at random (MAR) model [15] for incomplete data. Under the MAR model, we can classify the existing imputation methods of incomplete data into categories such as statistical-based [23], rule-based [12], constraint-based [38,44], and pattern-based [22] imputation methods. Due to textual property and sparseness of ER data sets, these works may fail to impute incomplete data, when there are only a few or even no samples for imputing missing attributes.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Online Topic-Aware Entity Resolution Over Incomplete Data Streams

Ren

Lian

Ghazinour

2021

Proceedings of the 2021 International Conference on Management of Data

View full text Add to dashboard Cite

In many real applications such as the data integration, social network analysis, and the Semantic Web, the entity resolution (ER) is an important and fundamental problem, which identifies and links the same real-world entities from various data sources. While prior works usually consider ER over static and complete data, in practice, application data are usually collected in a streaming fashion, and often incur missing attributes (due to the inaccuracy of data extraction techniques). Therefore, in this paper, we will formulate and tackle a novel problem, topic-aware entity resolution over incomplete data streams (TER-iDS), which online imputes incomplete tuples and detects pairs of topic-related matching entities from incomplete data streams. In order to effectively and efficiently tackle the TER-iDS problem, we propose an effective imputation strategy, carefully design effective pruning strategies, as well as indexes/synopsis, and develop an efficient TER-iDS algorithm via index joins. Extensive experiments have been conducted to evaluate the effectiveness and efficiency of our proposed TER-iDS approach over real data sets.

show abstract

Section: Imputation Over Incomplete Data Streammentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Online Topic-Aware Entity Resolution Over Incomplete Data Streams

Ren

Lian

Ghazinour

2021

Proceedings of the 2021 International Conference on Management of Data

View full text Add to dashboard Cite

show abstract

“…As shown in Table II, the 4-Type constraints embody the dependence on attributes (columns) and entities (rows) for temporal data. T-3: SD, SC [11], T-4: Similarity Constraints T-3: Variance Constraints [12] points in sequence as the simple instance of Type-1 constraints, CFD for relational data and Physical Mechanism for industrial data are concluded as multi-sequence constraints. Constraints, such as SD, SC, and VC, formalizing the dependence of data points along the time in one sequence belongs to Type-3 constraints.…”

Section: A Constraint-based Anomaly Detectionmentioning

confidence: 99%

“…Ihab F. Ilyas and Xu Chu give an overview of the end-to-end data cleaning process including error detection and repair methods in [10]. Both statistical-based [27], [28] and constraints-based [11], [29] cleaning are widely applied in temporal date quality improvement. [29] extends the idea of constraints from dependencies defined on relational database (e.g., FD, CFD in [30]), and proposes sequential dependencies (SD) to describe the semantics of temporal data.…”

Section: Related Workmentioning

confidence: 99%

“…[29] extends the idea of constraints from dependencies defined on relational database (e.g., FD, CFD in [30]), and proposes sequential dependencies (SD) to describe the semantics of temporal data. Accordingly, speed constraints are developed in sequential data and applied to time series cleaning solutions [11], [28]. Causality analysis tries to reason about the responsibility of a source in causing errors result.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Exploring Data and Knowledge combined Anomaly Explanation of Multivariate Industrial Data

Ding

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

The demand for high-performance anomaly detection techniques of IoT data becomes urgent, especially in industry field. The anomaly identification and explanation in time series data is one essential task in IoT data mining. Since that the existing anomaly detection techniques focus on the identification of anomalies, the explanation of anomalies is not well-solved. We address the anomaly explanation problem for multivariate IoT data and propose a 3-step self-contained method in this paper. We formalize and utilize the domain knowledge in our method, and identify the anomalies by the violation of constraints. We propose set-cover-based anomaly explanation algorithms to discover the anomaly events reflected by violation features, and further develop knowledge update algorithms to improve the original knowledge set. Experimental results on real datasets from largescale IoT systems verify that our method computes high-quality explanation solutions of anomalies. Our work provides a guide to navigate the explicable anomaly detection in both IoT fault diagnosis and temporal data cleaning.

show abstract

Skyline queries over incomplete data streams

2019

View full text Add to dashboard Cite

Nowadays, efficient and effective processing over massive stream data has attracted much attention from the database community, which are useful in many real applications such as sensor data monitoring, network intrusion detection, and so on. In practice, due to the malfunction of sensing devices or imperfect data collection techniques, real-world stream data may often contain missing or incomplete data attributes. In this paper, we will formalize and tackle a novel and important problem, named skyline query over incomplete data stream (Sky-iDS), which retrieves skyline objects (in the presence of missing attributes) with high confidences from incomplete data stream. In order to tackle the Sky-iDS problem, we will design efficient approaches to impute missing attributes of objects from incomplete data stream via differential dependency (DD) rules. We will propose effective pruning strategies to reduce the search space of the Sky-iDS problem, devise cost-model-based index structures to facilitate the data imputation and skyline computation at the same time, and integrate our proposed techniques into an efficient Sky-iDS query answering algorithm. Extensive experiments have been conducted to confirm the efficiency and effectiveness of our Sky-iDS processing approach over both real and synthetic data sets.

show abstract

Stream Data Cleaning under Speed and Acceleration Constraints

Cited by 28 publications

References 31 publications

Online Topic-Aware Entity Resolution Over Incomplete Data Streams

Online Topic-Aware Entity Resolution Over Incomplete Data Streams

Exploring Data and Knowledge combined Anomaly Explanation of Multivariate Industrial Data

Skyline queries over incomplete data streams

Contact Info

Product

Resources

About