Skipping-oriented partitioning for columnar layouts

Sun, Liwen; Franklin, Michael J.; Wang, Jiannan; Wu, Eugene

doi:10.14778/3025111.3025123

Cited by 41 publications

(18 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The results showed 3‐7x improvements in the query response time compared to the traditional range partitioning. In their latest work, Sun et al presented a novel hybrid data skipping framework that optimizes the overall query performance by automatically balancing skipping effectiveness and tuple‐reconstruction overhead. It allows both horizontal and vertical partitioning of the data, which maximizes the overall query performance.…”

Section: Background and Related Workmentioning

confidence: 99%

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Ivanov

Pergolesi

2019

Concurrency and Computation

View full text Add to dashboard Cite

Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Ivanov

Pergolesi

2019

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Partitioning a relation is NP-hard [72]. Data partitioning covers both the problem of partitioning a relation across multiple servers and within a single server [63,79,80]. Partitioning across both rows and columns is introduced by several systems to account for different read access patterns (e.g., on fact tables and dimension tables) [4,11,26].…”

Section: Related Workmentioning

confidence: 99%

Optimal column layout for hybrid workloads

2019

View full text Add to dashboard Cite

Data-intensive analytical applications need to support both efficient reads and writes. However, what is usually a good data layout for an update-heavy workload, is not well-suited for a read-mostly one and vice versa. Modern analytical data systems rely on columnar layouts and employ delta stores to inject new data and updates. We show that for hybrid workloads we can achieve close to one order of magnitude better performance by tailoring the column layout design to the data and query workload. Our approach navigates the possible design space of the physical layout: it organizes each column's data by determining the number of partitions, their corresponding sizes and ranges, and the amount of buffer space and how it is allocated. We frame these design decisions as an optimization problem that, given workload knowledge and performance requirements, provides an optimal physical layout for the workload at hand. To evaluate this work, we build an in-memory storage engine, Casper, and we show that it outperforms state-of-the-art data layouts of analytical systems for hybrid workloads. Casper delivers up to 2.32× higher throughput for update-intensive workloads and up to 2.14× higher throughput for hybrid workloads. We further show how to make data layout decisions robust to workload variation by carefully selecting the input of the optimization.

show abstract

“…In [17,18], different partitioning approaches are presented, which help in selective queries. In [17], data is divided into multiple horizontal partitions and in each partition, data is stored row-wise, rather than column-wise.…”

Section: Related Workmentioning

confidence: 99%

“…This eventually gives a feature-vector for every tuple, which is then used for filtering partitions. A similar vector is also used in [18], however this time it utilizes hybrid layouts with column grouping, instead of fixed row layouts. The latter helps for both selection and projection queries.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

ATUN-HL: Auto Tuning of Hybrid Layouts Using Workload and Data Characteristics

Munir

Abelló

Romero

et al. 2018

Advances in Databases and Information Systems

View full text Add to dashboard Cite

Ad-hoc analysis implies processing data in near real-time. Thus, raw data (i.e., neither normalized nor transformed) is typically dumped into a distributed engine, where it is generally stored into a hybrid layout. Hybrid layouts divide data into horizontal partitions and inside each partition, data are stored vertically. They keep statistics for each horizontal partition and also support encoding (i.e., dictionary) and compression to reduce the size of the data. Their built-in support for many ad-hoc operations (i.e., selection, projection, aggregation, etc.) makes hybrid layouts the best choice for most operations. Horizontal partition and dictionary sizes of hybrid layouts are configurable and can directly impact the performance of analytical queries. Hence, their default configuration cannot be expected to be optimal for all scenarios. In this paper, we present ATUN-HL (Auto TUNing Hybrid Layouts), which based on a cost model and given the workload and the characteristics of data, finds the best values for these parameters. We prototyped ATUN-HL for Apache Parquet, which is an open source implementation of hybrid layouts in Hadoop Distributed File System, to show its effectiveness. Our experimental evaluation shows that ATUN-HL provides on average 85% of all the potential performance improvement, and 1.2x average speedup against default configuration.

show abstract

Skipping-oriented partitioning for columnar layouts

Cited by 41 publications

References 33 publications

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Optimal column layout for hybrid workloads

ATUN-HL: Auto Tuning of Hybrid Layouts Using Workload and Data Characteristics

Contact Info

Product

Resources

About