Abstract:Automatically recycling (intermediate) results is a grand challenge for state-of-the-art databases to improve both query response time and throughput. Tuples are loaded and streamed through a tuple-at-a-time processing pipeline avoiding materialization of intermediates as much as possible. This limits the opportunities for reuse of overlapping computations to DBA-defined materialized views and function/result cache tuning. In contrast, the operator-at-a-time execution paradigm produces fully materialized resul… Show more
“…Furthermore, the HashStash optimizer supports four different cases for reuse-aware operators: exact-, subsuming-, partial -, and overlapping-reuse. This is different from the existing approaches in [15,25,18], which only support the exact-reuse, and the subsuming-reuse cases. The exact case enables a join or aggregation operator to reuse a cached hash table which contains exactly the tuples required by the query.…”
Section: Reuse-aware Query Optimizermentioning
confidence: 68%
“…Reuse of Intermediates: In order to better support user sessions in DBMSs, various techniques have been developed in the past to reuse intermediates [25,15,18]. All these techniques typically require that results of individual operators are materialized into temporary tables.…”
Section: Related Workmentioning
confidence: 99%
“…To that end, their cost models do not take the peculiarities of hash tables as well as hardware-dependent parameters such CPU caches into account. In [15], the authors integrate reuse techniques into MonetDB, that implements an operator-at-a-time execution model which anyway relies on full materialization of all intermediate results and thus does not need to tackle the issues that result form additional materialization cost as in pipelined databases. [18] extends the work of [15] for pipelined databases and integrates the ideas into Vectorwise.…”
Section: Related Workmentioning
confidence: 99%
“…Motivation: Reusing intermediates in databases to speedup analytical query processing has been studied in the past [15,25,18,13,8,20,28]. These solutions typically require intermediate results of individual operators be materialized into temporary tables to be considered for reuse in subsequent queries.…”
Reusing intermediates in databases to speed-up analytical query processing has been studied in the past. Existing solutions typically require intermediate results of individual operators to be materialized into temporary tables to be considered for reuse in subsequent queries. However, these approaches are fundamentally ill-suited for use in modern main memory databases. The reason is that modern main memory DBMSs are typically limited by the bandwidth of the memory bus, thus query execution is heavily optimized to keep tuples in the CPU caches and registers. To that end, adding additional materialization operations into a query plan not only add additional traffic to the memory bus but more importantly prevent the important cache-and registerlocality opportunities resulting in high performance penalties.In this paper we study a novel reuse model for intermediates, which caches internal physical data structures materialized during query processing (due to pipeline breakers) and externalizes them so that they become reusable for upcoming operations. We focus on hash tables, the most commonly used internal data structure in main memory databases to perform join and aggregation operations. As queries arrive, our reuse-aware optimizer reasons about the reuse opportunities for hash tables, employing cost models that take into account hash table statistics together with the CPU and data movement costs within the cache hierarchy. Experimental results, based on our HashStash prototype demonstrate performance gains of 2× for typical analytical workloads with no additional overhead for materializing intermediates.
“…Furthermore, the HashStash optimizer supports four different cases for reuse-aware operators: exact-, subsuming-, partial -, and overlapping-reuse. This is different from the existing approaches in [15,25,18], which only support the exact-reuse, and the subsuming-reuse cases. The exact case enables a join or aggregation operator to reuse a cached hash table which contains exactly the tuples required by the query.…”
Section: Reuse-aware Query Optimizermentioning
confidence: 68%
“…Reuse of Intermediates: In order to better support user sessions in DBMSs, various techniques have been developed in the past to reuse intermediates [25,15,18]. All these techniques typically require that results of individual operators are materialized into temporary tables.…”
Section: Related Workmentioning
confidence: 99%
“…To that end, their cost models do not take the peculiarities of hash tables as well as hardware-dependent parameters such CPU caches into account. In [15], the authors integrate reuse techniques into MonetDB, that implements an operator-at-a-time execution model which anyway relies on full materialization of all intermediate results and thus does not need to tackle the issues that result form additional materialization cost as in pipelined databases. [18] extends the work of [15] for pipelined databases and integrates the ideas into Vectorwise.…”
Section: Related Workmentioning
confidence: 99%
“…Motivation: Reusing intermediates in databases to speedup analytical query processing has been studied in the past [15,25,18,13,8,20,28]. These solutions typically require intermediate results of individual operators be materialized into temporary tables to be considered for reuse in subsequent queries.…”
Reusing intermediates in databases to speed-up analytical query processing has been studied in the past. Existing solutions typically require intermediate results of individual operators to be materialized into temporary tables to be considered for reuse in subsequent queries. However, these approaches are fundamentally ill-suited for use in modern main memory databases. The reason is that modern main memory DBMSs are typically limited by the bandwidth of the memory bus, thus query execution is heavily optimized to keep tuples in the CPU caches and registers. To that end, adding additional materialization operations into a query plan not only add additional traffic to the memory bus but more importantly prevent the important cache-and registerlocality opportunities resulting in high performance penalties.In this paper we study a novel reuse model for intermediates, which caches internal physical data structures materialized during query processing (due to pipeline breakers) and externalizes them so that they become reusable for upcoming operations. We focus on hash tables, the most commonly used internal data structure in main memory databases to perform join and aggregation operations. As queries arrive, our reuse-aware optimizer reasons about the reuse opportunities for hash tables, employing cost models that take into account hash table statistics together with the CPU and data movement costs within the cache hierarchy. Experimental results, based on our HashStash prototype demonstrate performance gains of 2× for typical analytical workloads with no additional overhead for materializing intermediates.
“…Caching to recycle work. Finally, we consider previous works [6,14,18,24,31] that address the problem of reusing intermediate query results, which is cast as a general caching problem. Our work substantially differs from those approaches in that they mainly focus on cache eviction, where past queries are used to decide what to keep in memory, in an on-line fashion.…”
In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in redundant and wasteful processing, multi-query optimization techniques can be employed to save a considerable amount of cluster resources. In this work, we introduce a novel method combining in-memory cache primitives and multi-query optimization, to improve the efficiency of data-intensive, scalable computing frameworks. By careful selection and exploitation of common (sub)expressions, while satisfying memory constraints, our method transforms a batch of queries into a new, more efficient one which avoids unnecessary recomputations. To find feasible and efficient execution plans, our method uses a cost-based optimization formulation akin to the multiple-choice knapsack problem. Extensive experiments on a prototype implementation of our system show significant benefits of worksharing for both TPC-DS workloads and detailed micro-benchmarks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.