The ability to estimate resource consumption of SQL queries is crucial for a number of tasks in a database system such as admission control, query scheduling and costing during query optimization. Recent work has explored the use of statistical techniques for resource estimation in place of the manually constructed cost models used in query optimization. Such techniques, which require as training data examples of resource usage in queries, offer the promise of superior estimation accuracy since they can account for factors such as hardware characteristics of the system or bias in cardinality estimates. However, the proposed approaches lack robustness in that they do not generalize well to queries that are different from the training examples, resulting in significant estimation errors. Our approach aims to address this problem by combining knowledge of database query processing with statistical models. We model resource-usage at the level of individual operators, with different models and features for each operator type, and explicitly model the asymptotic behavior of each operator. This results in significantly better estimation accuracy and the ability to estimate resource usage of arbitrary plans, even when they are very different from the training instances. We validate our approach using various large scale real-life and benchmark workloads on Microsoft SQL Server.
We identify proximity breach as a privacy threat specific to numerical sensitive attributes in anonymized data publication. Such breach occurs when an adversary concludes with high confidence that the sensitive value of a victim individual must fall in a short interval -even though the adversary may have low confidence about the victim's actual value.None of the existing anonymization principles (e.g., kanonymity, l-diversity, etc.) can effectively prevent proximity breach. We remedy the problem by introducing a novel principle called (ε, m)-anonymity. Intuitively, the principle demands that, given a QI-group G, for every sensitive value x in G, at most 1/m of the tuples in G can have sensitive values "similar" to x, where the similarity is controlled by ε. We provide a careful analytical study of the theoretical characteristics of (ε, m)-anonymity, and the corresponding generalization algorithm. Our findings are verified by experiments with real data.
This paper deals with a new type of privacy threat, called "corruption", in anonymized data publication. Specifically, an adversary is said to have corrupted some individuals, if s/he has already obtained their sensitive values before consulting the released information. Conventional generalization may lead to severe privacy disclosure in the presence of corruption. Motivated by this, we advocate an alternative anonymization technique that integrates generalization with perturbation and stratified sampling. The integration provides strong privacy guarantees, even if an adversary has corrupted any number of individuals. We verify the effectiveness of the proposed technique through experiments with real data.
Keyword search (KWS) over relational databases has recently received significant attention. Many solutions and many prototypes have been developed. This task requires addressing many issues, including robustness, accuracy, reliability, and privacy. An emerging issue, however, appears to be performance related: current KWS systems have unpredictable running times. In particular, for certain queries it takes too long to produce answers, and for others the system may even fail to return (e.g., after exhausting memory). In this paper we argue that as today's users have been "spoiled" by the performance of Internet search engines, KWS systems should return whatever answers they can produce quickly and then provide users with options for exploring any portion of the answer space not covered by these answers. Our basic idea is to produce answers that can be generated quickly as in today's KWS systems, then to show users query forms that characterize the unexplored portion of the answer space. Combining KWS systems with forms allows us to bypass the performance problems inherent to KWS without compromising query coverage. We provide a proof of concept for this proposed approach, and discuss the challenges encountered in building this hybrid system. Finally, we present experiments over real-world datasets to demonstrate the feasibility of the proposed solution.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.