A Study of SQL-on-Hadoop Systems

Chen, Yueguo; Qin, Xiongpai; Bian, Haoqiong; Chen, Jun; Dong, Zhaoan; Du, Xiaoyong; Gao, Yanjie; Liu, Dehai; Lu, Jiaheng; Zhang, Huijie

doi:10.1007/978-3-319-13021-7_12

Cited by 27 publications

(9 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, Chen et al compare multiple SQL‐on‐Hadoop engines using modified TPC‐DS queries on clusters with varying number of nodes. In terms of storage formats, they use the default ORC and Parquet configuration parameters.…”

Section: Background and Related Workmentioning

confidence: 99%

“…For example, ORC is favored by Hive and Presto, whereas Parquet is first choice for SparkSQL and Impala . A number of studies have investigated and compared the performance of file formats running them on different SQL‐on‐Hadoop engines. However, because of the different internal engine architectures, these works actually compare the engine together with its file format optimizations.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Ivanov

Pergolesi

2019

Concurrency and Computation

View full text Add to dashboard Cite

Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Ivanov

Pergolesi

2019

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…As seen previously in the literature, one important feature of data is the type of file formats, stating the way in which data is stored (Li & Zhou, ). For this benchmark, the recommendations of the Stinger initiative (Chen et al, ) were followed, storing data in the ORC format and using Tez as the execution engine when evaluating Hive. ORC stands for optimized row columnar, optimizing data storage when compared with other file formats.…”

Section: Experimental Evaluationmentioning

confidence: 99%

Big data processing tools: An experimental performance evaluation

Rodrigues

Santos

Bernardino

2018

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Big Data is currently a hot topic of research and development across several business areas mainly due to recent innovations in information and communication technologies. One of the main challenges of Big Data relates to how one should efficiently handle massive volumes of complex data. Due to the notorious complexity of the data that can be collected from multiple sources, usually motivated by increasing data volumes gathered at high velocity, efficient processing mechanisms are needed for data analysis purposes. Motivated by the rapid growth in technology, development of tools, and frameworks for Big Data, there is much discussion about Big Data querying tools and, specifically, those that are more appropriated for specific analytical needs. This paper describes and evaluates the following popular Big Data processing tools: Drill, HAWQ, Hive, Impala, Presto, and Spark. An experimental evaluation using the Transaction Processing Council (TPC‐H) benchmark is presented and discussed, highlighting the performance of each tool, according to different workloads and query types. This article is categorized under: Technologies > Computer Architectures for Data Mining Fundamental Concepts of Data and Knowledge > Big Data Mining Technologies > Data Preprocessing Application Areas > Data Mining Software Tools

show abstract

“…Since then, many database benchmarks have been proposed by academia and industry for various evaluation goals, such as TPC-C [25] for RDBMSs, TPC-DI [21] for data integration; OO7 benchmark [2] for object-oriented DBMSs, and XML benchmark systems [15,23] for XML DBMSs. More recently, the NoSQL and big data movement in the late 2000s brought the arrival of the next generation of benchmarks, such as YCSB benchmark [4] for cloud serving systems, LDBC [6] for Graph and RDF DBMSs, BigBench [3,10] for big data systems. However, those general-purpose or micro benchmarks are not designed for MMDBs.…”

Section: Introductionmentioning

confidence: 99%

UniBench: A Benchmark for Multi-model Database Management Systems

Zhang

et al. 2019

Performance Evaluation and Benchmarking for the Era of Artificial Intelligence

Self Cite

View full text Add to dashboard Cite

Unlike traditional database management systems which are organized around a single data model, a multi-model database (MMDB) utilizes a single, integrated back-end to support multiple data models, such as document, graph, relational, and key-value. As more and more platforms are proposed to deal with multi-model data, it becomes crucial to establish a benchmark for evaluating the performance and usability of MMDBs. Previous benchmarks, however, are inadequate for such scenario because they lack a comprehensive consideration for multiple models of data. In this paper, we present a benchmark, called UniBench, with the goal of facilitating a holistic and rigorous evaluation of MMDBs. UniBench consists of a mixed data model, a synthetic multi-model data generator, and a set of core workloads. Specifically, the data model simulates an emerging application: Social Commerce, a Web-based application combining E-commerce and social media. The data generator provides diverse data format including JSON, XML, key-value, tabular, and graph. The workloads are comprised of a set of multi-model queries and transactions, aiming to cover essential aspects of multi-model data management. We implemented all workloads on ArangoDB and OrientDB to illustrate the feasibility of our proposed benchmarking system and show the learned lessons through the evaluation of these two multi-model databases. The source code and data of this benchmark can be downloaded at http://udbms.cs.helsinki.fi/bench/.

show abstract

A Study of SQL-on-Hadoop Systems

Cited by 27 publications

References 10 publications

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Big data processing tools: An experimental performance evaluation

UniBench: A Benchmark for Multi-model Database Management Systems

Contact Info

Product

Resources

About