On the surprising difficulty of simple things

Schuhknecht, Felix; Khanchandani, Pankaj; Dittrich, Jens

doi:10.14778/2777598.2777602

Cited by 22 publications

(19 citation statements)

References 6 publications

(16 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using a real hash function would make all our algorithms slower by the same constant and thus result in even lower relative overhead of reproducibility. Our implementation of PARTITIONANDAGGREGATE is up to 4 times faster than that of Cieslewicz and Ross [11] because we use the highly-tuned partitioning routine used in other work [9,31,33]. Back-of-the-envelope calculations suggest that we achieve the same performance as the implementations used in [26], as well, thereby ensuring our baseline for GROUPBY matches the state of the art.…”

Section: A Experimental Setupmentioning

confidence: 97%

See 1 more Smart Citation

Reproducible Floating-Point Aggregation in RDBMSs

Müller

Arteaga

Hoefler

et al. 2018

2018 IEEE 34th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

Industry-grade database systems are expected to produce the same result if the same query is repeatedly run on the same input. However, the numerous sources of non-determinism in modern systems make reproducible results difficult to achieve. This is particularly true if floating-point numbers are involved, where the order of the operations affects the final result.As part of a larger effort to extend database engines with data representations more suitable for machine learning and scientific applications, in this paper we explore the problem of making relational GROUPBY over floating-point formats bit-reproducible, i.e., ensuring any execution of the operator produces the same result up to every single bit. To that aim, we first propose a numeric data type that can be used as drop-in replacement for other number formats and is-unlike standard floating-point formats-associative. We use this data type to make state-of-theart GROUPBY operators reproducible, but this approach incurs a slowdown between 4 × and 12 × compared to the same operator using conventional database number formats. We thus explore how to modify existing GROUPBY algorithms to make them bitreproducible and efficient. By using vectorized summation on batches and carefully balancing batch size, cache footprint, and preprocessing costs, we are able to reduce the slowdown due to reproducibility to a factor between 1.9 × and 2.4 × of aggregation in isolation and to a mere 2.7 % of end-to-end query performance even on aggregation-intensive queries in MonetDB. We thereby provide a solid basis for supporting more reproducible operations directly in relational engines.This document is an extended version of an article currently in print for the proceedings of ICDE'18 with the same title and by the same authors. The main additions are more implementation details and experiments.

show abstract

Section: A Experimental Setupmentioning

confidence: 97%

“…In this case, i.e., if F = 1, PARALLELPARTITION is a no-op that forwards its input. Since modern hardware can run PARTITIONING efficiently only up to a certain fanout [9,26,33], we implementing it recursively using zero or more levels of partitioning i.e., we partition with F = f d for f = 256 and d = 0, 1, . .…”

Section: B High-level Algorithm Structurementioning

confidence: 99%

Reproducible Floating-Point Aggregation in RDBMSs

Müller

Arteaga

Hoefler

et al. 2018

2018 IEEE 34th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

show abstract

“…The field of adaptive data stores is a hot research topic with a series of novel approaches, such as the popular database cracking [22,24,28,29], its variations and analysis [30,[58][59][60], advanced partitioning [32,44,61] or adaptive resp. holistic indexing [5,47,57].…”

Section: Related Workmentioning

confidence: 99%

GridTables: A One-Size-Fits-Most H2TAP Data Store

et al. 2020

View full text Add to dashboard Cite

Heterogeneous Hybrid Transactional Analytical Processing (H 2 TAP) database systems have been developed to match the requirements for low latency analysis of real-time operational data. Due to technical challenges, these systems are hard to architect, non-trivial to engineer, and complex to administrate. Current research has proposed excellent solutions to many of those challenges in isolation -a unified engine enabling to optimize performance by combining these solutions is still missing. In this concept paper, we suggest a highly flexible and adaptive data structure (called GRIDTABLE) to physically organize sparse but structured records in the context of H 2 TAP. For this, we focus on the design of an efficient highly-flexible storage layout that is built from scratch for mixed query workloads. The key challenges we address are:(1) partial storage in different memory locations, and (2) the ability to optimize for mixed OLTP-/OLAP access patterns. To guarantee safe and well-specified data definition or manipulation, as well as fast querying with no compromises on performance, we propose two dedicated access paths to the storage. In this paper, we explore the architecture and internals of GRIDTABLES showing design goals, concepts and trade-offs. We close this paper with open research questions and challenges that must be addressed in order to take advantage of the flexibility of our solution.

show abstract

“…It is a fundamental technique for indexing, join processing, and sorting. We investigate two state-of-the-art out-of-place partitioning algorithms [18], which either perform a histogram generation pass beforehand or maintain a linked list of chunks inside the partitions to handle the key distribution. We also test a version enlarging the partitions adaptively using mremap.…”

Section: Structural Flexibilitymentioning

confidence: 99%

RUMA has it

2016

Self Cite

View full text Add to dashboard Cite

Memory management is one of the most boring topics in database research. It plays a minor role in tasks like free-space management or efficient space usage. Here and there we also realize its impact on database performance when worrying about NUMA-aware memory allocation, data compacting, snapshotting, and defragmentation. But, overall, let's face it: the entire topic sounds as exciting as 'garbage collection' or 'debugging a program for memory leaks'. What if there were a technique that would promote memory management from a third class helper thingie to a first class citizen in algorithm and systems design? What if that technique turned the role of memory management in a database system (and any other data processing system) upside-down? What if that technique could be identified as a key for redesigning various core algorithms with the effect of outperforming existing state-of-the-art methods considerably? Then we would write this paper. We introduce RUMA: Rewired User-space Memory Access. It allows for physiological data management, i.e. we allow developers to freely rewire the mappings from virtual to physical memory (in user space) while at the same time exploiting the virtual memory support offered by hardware and operating system. We show that fundamental database building blocks such as array operations, partitioning, sorting, and snapshotting benefit strongly from RUMA.

show abstract

On the surprising difficulty of simple things

Cited by 22 publications

References 6 publications

Reproducible Floating-Point Aggregation in RDBMSs

Reproducible Floating-Point Aggregation in RDBMSs

GridTables: A One-Size-Fits-Most H2TAP Data Store

RUMA has it

Contact Info

Product

Resources

About