The multi-core architectures of today's computer systems make parallelism a necessity for performance critical applications. Writing such applications in a generic, hardware-oblivious manner is a challenging problem: Current database systems thus rely on laborintensive and error-prone manual tuning to exploit the full potential of modern parallel hardware architectures like multi-core CPUs and graphics cards. We propose an alternative design for a parallel database engine, based on a single set of hardware-oblivious operators, which are compiled down to the actual hardware at runtime. This design reduces the development overhead for parallel database engines, while achieving competitive performance to hand-tuned systems.We provide a proof-of-concept for this design by integrating operators written using the parallel programming framework OpenCL into the open-source database MonetDB. Following this approach, we achieve efficient, yet highly portable parallel code without the need for optimization by hand. We evaluated our implementation against MonetDB using TPC-H derived queries and observed a performance that rivals that of MonetDB's query execution on the CPU and surpasses it on the GPU. In addition, we show that the same set of operators runs nearly unchanged on a GPU, demonstrating the feasibility of our approach.
In-memory databases require careful tuning and many engineering tricks to achieve good performance. Such database performance engineering is hard: a plethora of data and hardware-dependent optimization techniques form a design space that is difficult to navigate for a skilled engineer-even more so for a query compiler. To facilitate performanceoriented design exploration and query plan compilation, we present Voodoo, a declarative intermediate algebra that abstracts the detailed architectural properties of the hardware, such as multi-or many-core architectures, caches and SIMD registers, without losing the ability to generate highly tuned code. Because it consists of a collection of declarative, vector-oriented operations, Voodoo is easier to reason about and tune than low-level C and related hardware-focused extensions (Intrinsics, OpenCL, CUDA, etc.). This enables our Voodoo compiler to produce (OpenCL) code that rivals and even outperforms the fastest state-of-the-art in memory databases for both GPUs and CPUs. In addition, Voodoo makes it possible to express techniques as diverse as cacheconscious processing, predication and vectorization (again on both GPUs and CPUs) with just a few lines of code. Central to our approach is a novel idea we termed control vectors, which allows a code generating frontend to expose parallelism to the Voodoo compiler in a abstract manner, enabling portable performance across hardware platforms. We used Voodoo to build an alternative backend for Mon-etDB, a popular open-source in-memory database. Our backend allows MonetDB to perform at the same level as highly tuned in-memory databases, including HyPeR and Ocelot. We also demonstrate Voodoo's usefulness when investigating hardware conscious tuning techniques, assessing their performance on different queries, devices and data.
Modern analytics applications use a diverse mix of libraries and functions. Unfortunately, there is no optimization across these libraries, resulting in performance penalties as high as an order of magnitude in many applications. To address this problem, we proposed Weld, a common runtime for existing data analytics libraries that performs key physical optimizations such as pipelining under existing, imperative library APIs. In this work, we further develop the Weld vision by designing an automatic adaptive optimizer for Weld applications, and evaluating its impact on realistic data science workloads. Our optimizer eliminates multiple forms of overhead that arise when composing imperative libraries like Pandas and NumPy, and uses lightweight measurements to make data-dependent decisions at runtime in ad-hoc workloads where no statistics are available, with sub-second overhead. We also evaluate which optimizations have the largest impact in practice and whether Weld can be integrated into libraries incrementally. Our results are promising: using our optimizer, Weld accelerates data science workloads by up to 23× on one thread and 80× on eight threads, and its adaptive optimizations provide up to a 3.75× speedup over rule-based optimization. Moreover, Weld provides benefits if even just 4-5 operators in a library are ported to use it. Our results show that common runtime designs like Weld may be a viable approach to accelerate analytics.
A common operation in many data analytics workloads is to find the top-k items, i.e., the largest or smallest operations according to some sort order (implemented via LIMIT or ORDER BY expressions in SQL). A naive implementation of top-k is to sort all of the items and then return the first k, but this does much more work than needed. Although efficient implementations for top-k have been explored on traditional multi-core processors, there has been no prior systematic study of top-k implementations on GPUs, despite open requests for such implementations in GPU-based frameworks like TensorFlow 1 and ArrayFire 2 . In this work, we present several top-k algorithms for GPUs, including a new algorithm based on bitonic sort called bitonic top-k. The bitonic top-k algorithm is up to a factor of 15x faster than sort and 4x faster than a variety of other possible implementations for values of k up to 256. We also develop a cost model to predict the performance of several of our algorithms, and show that it accurately predicts actual performance on modern GPUs.
Abstract-The variety of memory devices in modern computer systems holds opportunities as well as challenges for data management systems. In particular, the exploitation of Graphics Processing Units (GPUs) and their fast memory has been studied quite intensively. However, current approaches treat GPUs as systems in their own right and fail to provide a generic strategy for efficient CPU/GPU cooperation. We propose such a strategy for relational query processing: calculating an approximate result based on lossily compressed, GPU-resident data and refine the result using residuals, i.e., the lost data, on the CPU. We developed the required algorithms, implemented the strategy in an existing DBMS and found up to 8 times performance improvement, even for datasets larger than the available GPU memory.
Abstract-Memory-Resident Database Management Systems (MRDBMS) have to be optimized for two resources: CPU cycles and memory bandwidth. To optimize for bandwidth in mixed OLTP/OLAP scenarios, the hybrid or Partially Decomposed Storage Model (PDSM) has been proposed. However, in current implementations, bandwidth savings achieved by partial decomposition come at increased CPU costs. To achieve the aspired bandwidth savings without sacrificing CPU efficiency, we combine partially decomposed storage with Just-in-Time (JiT) compilation of queries, thus eliminating CPU inefficient function calls. Since existing cost based optimization components are not designed for JiT-compiled query execution, we also develop a novel approach to cost modeling and subsequent storage layout optimization. Our evaluation shows that the JiT-based processor maintains the bandwidth savings of previously presented hybrid query processors but outperforms them by two orders of magnitude due to increased CPU efficiency.
The diversity of hardware components within a single system calls for strategies for efficient cross-device data processing. For example, existing approaches to CPU/GPU co-processing distribute individual relational operators to the "most appropriate" device. While pleasantly simple, this strategy has a number of problems: it may leave the "inappropriate" devices idle while overloading the "appropriate" device and putting a high pressure on the PCI bus. To address these issues we distribute data among the devices by partially decomposing relations at the granularity of individual bits. Each of the resulting bit-partitions is stored and processed on one of the available devices. Using this strategy, we implemented a processor for spatial range queries that makes efficient use of all available devices. The performance gains achieved indicate that bitwise distribution makes a good cross-device processing strategy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.