Abstract:Automatically recycling (intermediate) results is a grand challenge for state-of-the-art databases to improve both query response time and throughput. Tuples are loaded and streamed through a tuple-at-a-time processing pipeline avoiding materialization of intermediates as much as possible. This limits the opportunities for reuse of overlapping computations to DBA-defined materialized views and function/result cache tuning. In contrast, the operator-at-a-time execution paradigm produces fully materialized resul… Show more
“…Such techniques help avoiding recomputing identical queries but cannot be predicted by the optimizer nor used in a canonical way. Another option, implemented in MonetDB, is to recycle intermediate results and share them with subsequent queries [16]. Recent research on work sharing, however, offers ad-hoc "collaboration" of the concurrent queries minimizing the overall work done and the number of data accesses.…”
As data analytics is used by an increasing number of applications, data analytics engines are required to execute workloads with increased concurrency, i.e., an increasing number of clients submitting queries. Data management systems designed for data analytics -a market dominated by column-stores -however, were initially optimized for single query execution, minimizing its response time. Hence, they do not treat concurrency as a first class citizen.In this paper, we experiment with one open-source and two commercial column-stores using the TPC-H and SSB benchmarks in a setup with an increasing number of concurrent clients submitting queries, focusing on whether the tested systems can scale up in a single node instance. The tested systems for in-memory workloads scale up, to some degree; however, when the server is saturated they fail to fully exploit the available parallelism. Further, we highlight the unpredictable response times for high concurrency.
“…Such techniques help avoiding recomputing identical queries but cannot be predicted by the optimizer nor used in a canonical way. Another option, implemented in MonetDB, is to recycle intermediate results and share them with subsequent queries [16]. Recent research on work sharing, however, offers ad-hoc "collaboration" of the concurrent queries minimizing the overall work done and the number of data accesses.…”
As data analytics is used by an increasing number of applications, data analytics engines are required to execute workloads with increased concurrency, i.e., an increasing number of clients submitting queries. Data management systems designed for data analytics -a market dominated by column-stores -however, were initially optimized for single query execution, minimizing its response time. Hence, they do not treat concurrency as a first class citizen.In this paper, we experiment with one open-source and two commercial column-stores using the TPC-H and SSB benchmarks in a setup with an increasing number of concurrent clients submitting queries, focusing on whether the tested systems can scale up in a single node instance. The tested systems for in-memory workloads scale up, to some degree; however, when the server is saturated they fail to fully exploit the available parallelism. Further, we highlight the unpredictable response times for high concurrency.
“…The intermediate results of the subqueries are shipped to the master server (10). Finally, it wraps up the query execution and sends the results to the user (11). Recycler.…”
Section: Architecturementioning
confidence: 99%
“…Recycler. A crucial component of the Octopus architecture is the MonetDB Recycler [11]. It is an extension of MonetDB execution model with capability to store and reuse intermediate results in query loads with overlapping computations.…”
Section: Architecturementioning
confidence: 99%
“…If the set intersection of the subplans of the arguments is not empty, meaning that they all belong to at least one common subplan, the instruction is assigned to the same subplan(s) (lines [11][12]. Following this general rule, the data access instructions to small query tables are replicated to all subplans.…”
Section: Distributed Plan Generationmentioning
confidence: 99%
“…It creates distributed execution plans and delegates subquery execution to available worker nodes, referred to as octopus tentacles. Data are shipped just-in-time (JIT) to the workers and kept in their caches using the recycler mechanism [11]. The run-time scheduler allocates subqueries on tentacles based on up-to-date status information.…”
Abstract. Distributed processing commonly requires data spread across machines using a priori static or hash-based data allocation. In this paper, we explore an alternative approach that starts from a master node in control of the complete database, and a variable number of worker nodes for delegated query processing. Data is shipped just-in-time to the worker nodes using a need to know policy, and is being reused, if possible, in subsequent queries. A bidding mechanism among the workers yields a scheduling with the most efficient reuse of previously shipped data, minimizing the data transfer costs. Just-in-time data shipment allows our system to benefit from locally available idle resources to boost overall performance. The system is maintenance-free and allocation is fully transparent to users. Our experiments show that the proposed adaptive distributed architecture is a viable and flexible alternative for small scale MapReduce-type of settings.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.