Enriching data imputation with extensive similarity neighbors

Song, Shaoxu; Zhang, Aoqian; Chen, Lei; Wang, Jianmin

doi:10.14778/2809974.2809989

Cited by 41 publications

(23 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A lot of work deals with queries that are incomplete with respect to missing attribute values in the entries of the (otherwise complete) result set [2,9,14,17,19,21]. A common solution to resolve this kind of incompleteness, as well as fuzzy searches over hidden databases in general, builds on query refactoring [14,17,21] and educated guessing of values [2,9,19]. However, such approaches are not usable in our problem context, since our notion of incomplete queries relates to restrictions of the result set size.…”

Section: Related Workmentioning

confidence: 99%

A third-party replication service for dynamic hidden databases

Hintzen

Liesy

Zirpins

2021

SOCA

View full text Add to dashboard Cite

Much data on the web is available in hidden databases. Users browse their contents by sending search queries to form-based interfaces or APIs. Yet, hidden databases just return the top-k result entries and limit the number of queries per time interval. Such access restrictions constrict those tasks that require many/specific queries or need to access many/all data entries. For a temporary solution, an unrestricted local snapshot can be created by crawling the hidden database. Yet, keeping the snapshot permanently consistent is challenging due to the access restrictions of its origin. In this paper, we propose a replication approach providing permanent unrestricted access to the local copy of a hidden database with dynamic changes. To this end, we present an algorithm to effectively crawl hidden databases that outperforms the state of the art. Furthermore, we propose a new way to continuously control the consistency of the replicated database in an efficient manner. We also introduce the cloud-based architecture of a replication service for hidden databases. We show the effectiveness of the approach through a variety of reproducible experimental evaluations.

show abstract

Section: Related Workmentioning

confidence: 99%

A third-party replication service for dynamic hidden databases

Hintzen

Liesy

Zirpins

2021

SOCA

View full text Add to dashboard Cite

show abstract

“…The

function determines the structural similarity among the target

and

, the higher the numerical value is, a more closer structural description of

instance is with

description [ 31 , 32 ]. As a result, structural attributes are suggested for a tuple

with missing attributes.…”

Section: Methodsmentioning

confidence: 99%

SemImput: Bridging Semantic Imputation with Deep Learning for Complex Human Activity Recognition

Razzaq

Cleland

Nugent

et al. 2020

Sensors

View full text Add to dashboard Cite

The recognition of activities of daily living (ADL) in smart environments is a well-known and an important research area, which presents the real-time state of humans in pervasive computing. The process of recognizing human activities generally involves deploying a set of obtrusive and unobtrusive sensors, pre-processing the raw data, and building classification models using machine learning (ML) algorithms. Integrating data from multiple sensors is a challenging task due to dynamic nature of data sources. This is further complicated due to semantic and syntactic differences in these data sources. These differences become even more complex if the data generated is imperfect, which ultimately has a direct impact on its usefulness in yielding an accurate classifier. In this study, we propose a semantic imputation framework to improve the quality of sensor data using ontology-based semantic similarity learning. This is achieved by identifying semantic correlations among sensor events through SPARQL queries, and by performing a time-series longitudinal imputation. Furthermore, we applied deep learning (DL) based artificial neural network (ANN) on public datasets to demonstrate the applicability and validity of the proposed approach. The results showed a higher accuracy with semantically imputed datasets using ANN. We also presented a detailed comparative analysis, comparing the results with the state-of-the-art from the literature. We found that our semantic imputed datasets improved the classification accuracy with 95.78% as a higher one thus proving the effectiveness and robustness of learned models.

show abstract

“…Differential dependency (DD) [32] is a valuable tool for data imputation [34], data cleaning [28], data repairing [33], and so on. Song et al [34] used the DDs to fill the missing attributes of incomplete objects on static data set via some detected neighbors satisfying the distance constraints on determinant attributes. Song et al [33,36] also explored to repair labels of graph nodes.…”

Section: Related Workmentioning

confidence: 99%

Efficient Join Processing Over Incomplete Data Streams

Ren

Lian

Ghazinour

2019

Proceedings of the 28th ACM International Conference on Information and Knowledge Management

View full text Add to dashboard Cite

For decades, the join operator over fast data streams has always drawn much attention from the database community, due to its wide spectrum of real-world applications, such as online clustering, intrusion detection, sensor data monitoring, and so on. Existing works usually assume that the underlying streams to be joined are complete (without any missing values). However, this assumption may not always hold, since objects from streams may contain some missing attributes, due to various reasons such as packet losses, network congestion/failure, and so on. In this paper, we formalize an important problem, namely join over incomplete data streams (Join-iDS), which retrieves joining object pairs from incomplete data streams with high confidences. We tackle the Join-iDS problem in the style of "data imputation and query processing at the same time". To enable this style, we design an effective and efficient cost-modelbased imputation method via deferential dependency (DD), devise effective pruning strategies to reduce the Join-iDS search space, and propose efficient algorithms via our proposed cost-model-based data synopsis/indexes. Extensive experiments have been conducted to verify the efficiency and effectiveness of our proposed Join-iDS approach on both real and synthetic data sets. Figure 1 illustrates two critical routers, O and U , in an IP network, from which we collect statistical (log) attributes in a streaming manner, for example, No. of connections, the connection duration, and the transferred data size. In practice, due to packet losses, network congestion/delays, or hardware failure, we may not always obtain all attributes from each router. As an example in Table 1, the transferred data size of router o t is missing (denoted as "-") at timestamp t. As a result, stream data collected from each router may sometimes contain incomplete attributes. One critical, yet challenging, problem in the network is to monitor network traffic, and detect potential network intrusion. If one router (e.g., O) is under the attack of network intrusion, we should quickly identify potential attacks in other routers, like U , at close timestamps, to which we may take actions for protecting the network security. In

show abstract

Enriching data imputation with extensive similarity neighbors

Cited by 41 publications

References 19 publications

A third-party replication service for dynamic hidden databases

A third-party replication service for dynamic hidden databases

SemImput: Bridging Semantic Imputation with Deep Learning for Complex Human Activity Recognition

Efficient Join Processing Over Incomplete Data Streams

Contact Info

Product

Resources

About