Cleaning uncertain data with quality guarantees

Cheng, Reynold; Chen, Jinchuan; Xie, Xike

doi:10.14778/1453856.1453935

Cited by 78 publications

(59 citation statements)

References 35 publications

(80 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…e.g. data collected from sensor networks [15], information extraction from the web [16,17], data integration [18,19], data cleaning [20][21][22][23][24][25], social networks [26,27], radio frequency identification RFID [7]. Due to various reasons that differ from one application to another, the uncertainty is inherent in such applications.…”

Section: Related Workmentioning

confidence: 99%

Framework for Managing Uncertain Distributed Categorical Data

Benaissa¹,

Yahmi²,

Jamil³

2017

ijacsa

View full text Add to dashboard Cite

Abstract-In recent years, data has become uncertain due to the flourishing advanced technologies that participate continuously and increasingly in producing large amounts of incomplete data. Often, many modern applications where uncertainty occurs are distributed in nature, e.g., distributed sensor networks, information extraction, data integration, social network, etc. Consequently, even though the data uncertainty has been studied in the past for centralized behavior, it is still a challenging issue to manage uncertainty over the data in situ. In this paper, we propose a framework to managing uncertain categorical data over distributed environments that is built upon a hierarchical indexing technique based on inverted index, and a distributed algorithm to efficiently process queries on uncertain data in distributed environment. Leveraging this indexing technique, we address two kinds of queries on the distributed uncertain databases 1) a distributed probabilistic thresholds query, where its answers satisfy the probabilistic threshold requirement; and 2) a distributed top-k-queries, optimizing, the transfer of the tuples from the distributed sources to the coordinator site and the time treatment. Extensive experiments are conducted to verify the effectiveness and efficiency of the proposed method in terms of communication costs and response time.

show abstract

Section: Related Workmentioning

confidence: 99%

Framework for Managing Uncertain Distributed Categorical Data

Benaissa¹,

Yahmi²,

Jamil³

2017

ijacsa

View full text Add to dashboard Cite

show abstract

“…The probabilistic models of the uncertain data fall into two categories: one is the possible world model [13,14] [20], and the other is the probability function model [21], in which the existence of a record is represented by a probability density function. Till now, the research for query processing over uncertain data mainly focuses on nearest neighbor (NN) problem [21], K-nearest neighbor (K-NN) problem [22], join operation [23], ranking operation [20], and top-K queries [24].…”

Section: Related Workmentioning

confidence: 99%

“…According to the possible world model [13][14][15], the skyline over uncertain data is not determinate, and any possible skyline has an existence probability. In such a case, people may ask that "what are the skylines of the data with existence probabilities greater than a given constant p?"…”

Section: Introductionmentioning

confidence: 99%

Efficient Pr-Skyline Query Processing and Optimization in Wireless Sensor Networks

Xiong²

2010

WSN

View full text Add to dashboard Cite

As one of the commonly used queries in modern databases, skyline query has received extensive attention from database research community. The uncertainty of the data in wireless sensor networks makes the corresponding skyline uncertain and not unique. This paper investigates the Pr-Skyline problem, i.e., how to compute the skyline with the highest existence probability in a computational and energy-efficient way. We formulate the problem and prove that it is NP-Complete and cannot be approximated in a given expression. However, the proposed algorithm SKY-SEARCH with pruning techniques can guarantee the computational efficiency given relatively large input size, while the filter-based distributed optimization strategy significantly reduces the transmission cost and the required storage space of the sensor nodes. Extensive experiments verify the efficiency and scalability of SKY-SEARCH and the distributed optimizing strategy.

show abstract

“…The mobile trajectory data precision is not high, which is unable to accurately describe the user's position every time when the mobile phone accesses the base station. The literature (Cheng, Chen and Xie, 2008)puts forward the precision evaluation method based on the semantic data, and gives a data cleaning solution. But its work is not for trajectory data.…”

Section: Introductionmentioning

confidence: 99%

Position Mining Algorithm Based on Massive Trajectory Data

2016

Rev. Téc. Ing. Univ. Zulia

View full text Add to dashboard Cite

The important position refers to the main sites of activities in the daily life of people, such as the residence and workplace. The log automatically generated by the mobile phones throughaccess to the base station is also an important data source for the user behavior pattern mining, for example, the identification of the important position. However, the relevant work is faced with many challenges, including the massive volume of the trajectory data, the low precision of the positionand the diversity of the mobile phone users. For this purpose, a general solution framework is put forward to improve the availability of the trajectory data. The framework includes a filter module that is based on state, which has improved the availability of data, as well as an important position mining module. Based on this framework, two kinds of distributed mining algorithms have been designed: FQM and AXM, and optimization is carried out from three aspects: (1) Make use of the fusion technology for the multivariate data, so as to improve the precision of the results. (2) The identification algorithm for the non-working population is put forward; (3) The identification algorithm for the night working population is put forward; and the theoretical analysis and experimental results show that, the proposed algorithms are of relatively high execution efficiency and scalability, with higher precision.

show abstract

Cleaning uncertain data with quality guarantees

Cited by 78 publications

References 35 publications

Framework for Managing Uncertain Distributed Categorical Data

Framework for Managing Uncertain Distributed Categorical Data

Efficient Pr-Skyline Query Processing and Optimization in Wireless Sensor Networks

Position Mining Algorithm Based on Massive Trajectory Data

Contact Info

Product

Resources

About