2019
DOI: 10.1002/cpe.5523
|View full text |Cite
|
Sign up to set email alerts
|

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Abstract: Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(11 citation statements)
references
References 36 publications
0
11
0
Order By: Relevance
“…Also, minor changes required for serialization of genomic intervals parameters have been added to the disq (https://github.com/ mwiewior/disq) library. For saving output we support not only ORC but also Parquet file format (Ivanov and Pergolesi, 2020). In our code, we re-used partition coalescing mechanism as implemented in the GATK.…”
Section: Technical Designmentioning
confidence: 99%
“…Also, minor changes required for serialization of genomic intervals parameters have been added to the disq (https://github.com/ mwiewior/disq) library. For saving output we support not only ORC but also Parquet file format (Ivanov and Pergolesi, 2020). In our code, we re-used partition coalescing mechanism as implemented in the GATK.…”
Section: Technical Designmentioning
confidence: 99%
“…After the data were retrieved from the sources, the same were stored in the HDFS, using the Parquet format, which is one of the several formats that can be used to store data in the HDFS. Other formats that can be used are, for example, ORC or AVRO [35]. Parquet was chosen not only due to its adequate compatibility with Spark and Impala technology, but also due to its read-oriented format and adequate compression, which would bring advantages when we need to query the data [36].…”
Section: Technological Architecturementioning
confidence: 99%
“…The research results are recommendations on the use of each format for specific tasks. [30] is a comprehensive study of the Apache Parquet and ORC formats. Both formats are column-oriented and share similar characteristics and properties.…”
Section: E Analysis Of Data Storage Formatsmentioning
confidence: 99%
“…The study [31] developed a methodology for analyzing data storage formats based on comparative analysis, experimental evaluation and a mathematical model for choosing an alternative. For the experimental evaluation, Apache Spark [24] framework was used, which is one of the most popular tools for analyzing data in the Apache Hadoop system.…”
Section: E Analysis Of Data Storage Formatsmentioning
confidence: 99%