The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Ivanov, Todor; Pergolesi, Matteo

doi:10.1002/cpe.5523

Cited by 20 publications

(11 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Also, minor changes required for serialization of genomic intervals parameters have been added to the disq (https://github.com/ mwiewior/disq) library. For saving output we support not only ORC but also Parquet file format (Ivanov and Pergolesi, 2020). In our code, we re-used partition coalescing mechanism as implemented in the GATK.…”

Section: Technical Designmentioning

confidence: 99%

Cloud-native distributed genomic pileup operations

Wiewiórka

Szmurło

Stankiewicz

et al. 2022

Preprint

View full text Add to dashboard Cite

Motivation: Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computation nodes. Results: Here, we present a scalable, distributed, and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5x faster) and memory usage (up to 2x less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform, or Amazon Web Services. Together with the already implemented distributed range joins and coverage calculations, our package provides end-users with an unified SQL interface for convenient analyzing of population-scale genomic data in an interactive way. See https://biodatageeks.github.io/sequila/ for details.

show abstract

Section: Technical Designmentioning

confidence: 99%

Cloud-native distributed genomic pileup operations

Wiewiórka

Szmurło

Stankiewicz

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…After the data were retrieved from the sources, the same were stored in the HDFS, using the Parquet format, which is one of the several formats that can be used to store data in the HDFS. Other formats that can be used are, for example, ORC or AVRO [35]. Parquet was chosen not only due to its adequate compatibility with Spark and Impala technology, but also due to its read-oriented format and adequate compression, which would bring advantages when we need to query the data [36].…”

Section: Technological Architecturementioning

confidence: 99%

Advancing Logistics 4.0 with the Implementation of a Big Data Warehouse: A Demonstration Case for the Automotive Industry

et al. 2021

View full text Add to dashboard Cite

The constant advancements in Information Technology have been the main driver of the Big Data concept’s success. With it, new concepts such as Industry 4.0 and Logistics 4.0 are arising. Due to the increase in data volume, velocity, and variety, organizations are now looking to their data analytics infrastructures and searching for approaches to improve their decision-making capabilities, in order to enhance their results using new approaches such as Big Data and Machine Learning. The implementation of a Big Data Warehouse can be the first step to improve the organizations’ data analysis infrastructure and start retrieving value from the usage of Big Data technologies. Moving to Big Data technologies can provide several opportunities for organizations, such as the capability of analyzing an enormous quantity of data from different data sources in an efficient way. However, at the same time, different challenges can arise, including data quality, data management, and lack of knowledge within the organization, among others. In this work, we propose an approach that can be adopted in the logistics department of any organization in order to promote the Logistics 4.0 movement, while highlighting the main challenges and opportunities associated with the development and implementation of a Big Data Warehouse in a real demonstration case at a multinational automotive organization.

show abstract

“…The research results are recommendations on the use of each format for specific tasks. [30] is a comprehensive study of the Apache Parquet and ORC formats. Both formats are column-oriented and share similar characteristics and properties.…”

Section: E Analysis Of Data Storage Formatsmentioning

confidence: 99%

“…The study [31] developed a methodology for analyzing data storage formats based on comparative analysis, experimental evaluation and a mathematical model for choosing an alternative. For the experimental evaluation, Apache Spark [24] framework was used, which is one of the most popular tools for analyzing data in the Apache Hadoop system.…”

Section: E Analysis Of Data Storage Formatsmentioning

confidence: 99%

Analysis of Big Data Storage Tools for Data Lakes based on Apache Hadoop Platform

Belov¹,

Nikulchev²

2021

IJACSA

View full text Add to dashboard Cite

When developing large data processing systems, the question of data storage arises. One of the modern tools for solving this problem is the so-called data lakes. Many implementations of data lakes use Apache Hadoop as a basic platform. Hadoop does not have a default data storage format, which leads to the task of choosing a data format when designing a data processing system. To solve this problem, it is necessary to proceed from the results of the assessment according to several criteria. In turn, experimental evaluation does not always give a complete understanding of the possibilities for working with a particular data storage format. In this case, it is necessary to study the features of the format, its internal structure, recommendations for use, etc. The article describes the features of both widely used data storage formats and the currently gaining popularity.

show abstract

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Cited by 20 publications

References 36 publications

Cloud-native distributed genomic pileup operations

Cloud-native distributed genomic pileup operations

Advancing Logistics 4.0 with the Implementation of a Big Data Warehouse: A Demonstration Case for the Automotive Industry

Analysis of Big Data Storage Tools for Data Lakes based on Apache Hadoop Platform

Contact Info

Product

Resources

About