Performance Evaluation of Spark SQL Using BigBench

Ivanov, Todor; Beer, Max-Georg

doi:10.1007/978-3-319-49748-8_6

Cited by 8 publications

(10 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Queries Q02 and Q30 achieve standard deviations varying between 7% and 16%, which will be explained in Section 5.4. All other queries have standard deviations around 10%, which indicates that SparkSQL is less stable than Hive as reported in the works of Ivanov and Beer . We believe this is also due to execution noise in the cluster affecting more SparkSQL times that are generally much shorter compared to Hive ones.…”

Section: Spark Sqlmentioning

confidence: 76%

See 1 more Smart Citation

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Ivanov

Pergolesi

2019

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.

show abstract

Section: Spark Sqlmentioning

confidence: 76%

“…This work is a continuation of a series of benchmark experiments conducted at the Frankfurt Big Data Lab.…”

Section: Introductionmentioning

confidence: 99%

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Ivanov

Pergolesi

2019

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…Apache Spark [25] similar to Hive, Spark is another popular framework gaining momentum [12]. Spark is a processing engine which provided increased performance over the original Map/Reduce by leveraging in-memory computation.…”

Section: Background and Related Workmentioning

confidence: 99%

“…We believe that BigBench [5] is the current reference benchmark for such systems. Relating to BigBench query results, to day there are only a handful of official submissions are available [20], as well as a few publications with detailed per query characterization [2,12]. More established benchmarks i.e., TPC-H have been thoroughly analyzed, in work including the query their choke points as in "TPC-H Analyzed" [1].…”

Section: Related Workmentioning

confidence: 99%

Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

Poggi

Montero

Carrera

2017

Performance Evaluation and Benchmarking for the Analytics Era

View full text Add to dashboard Cite

Abstract. BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The BigBench specification describes several business use cases which require a broad combination of data extraction techniques including SQL queries, Map/Reduce, user code (UDF), and Machine Learning code. However, currently there is not widespread knowledge of the different resource requirements of each query, as is the case to more established benchmarks. Moreover, the current BigBench implementation allows us to combine different frameworks and libraries from the Hadoop ecosystem. Including combinations such as Hadoop+Hive+Tez (with Mahout) and Spark (SparkSQL+MLlib) in their different versions and configurations. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2. It is our intent to compare the current state of Spark v2 to Hive's base implementation. At the same time, cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Hive and Spark come ready to use, with a general-purpose configuration and upgrade management. The study characterizes TPCx-BB queries and the out-of-the-box performance of Spark and Hive versions in the cloud. At the same time, comparing popular PaaS offerings, reliability, scalability, and performance, including Azure HDinsight, Amazon Web Services EMR, and Google Dataproc, with an onpremises commodity cluster as baseline. Results show how there is a need for configuration tuning in most cloud providers as data scales grows, especially with Sparks memory usage. The query characterization shows queries are the most resource consuming according to CPU, Memory (especially for ML), and I/O both disk and network. These results can help practitioners to quickly test systems by picking a subset of the queries which stresses each of the categories. At the same time, results show how Hive and Spark compare in the different query types and what performance can be expected of each in PaaS.

show abstract

“…[51]:•yarn.nodemanager.resource.memory-mb • yarn.nodemanager.resource.cpu-vcores • yarn.scheduler.maximum-allocation-mb • yarn.scheduler.minimum-allocation-mb • yarn.scheduler.maximum-allocation-vcores • yarn.scheduler.minimum-allocation-vcores…”

mentioning

confidence: 99%

Improving Real Time Tuning on YARN

Belmonte¹,

Pablo²

View full text Add to dashboard Cite

Big data is becoming a significant part of the operations of modern organizations. Performing analysis of large amounts of data requires computer clusters to run the calculations and analysis. YARN is an internal framework that is responsible for coordinating big data jobs for some popular distributed storage and processing frameworks like Hadoop and Spark. Running YARN with the correct configuration parameters is critical for the good performance of a cluster. KERMIT is an online tuner of YARN configuration parameters that aims to improve cluster performance. The first KER-MIT implementation proved the feasibility of the concept. In this study we modified the tuning algorithms inside the KERMIT components; by doing so, we achieved a reduction in the execution time as compared to industry benchmarks for a shallow tuning technique. We also verified that KERMIT can tune Spark; this suggests that Kermit could be used in other YARN-based frameworks.

show abstract

Performance Evaluation of Spark SQL Using BigBench

Cited by 8 publications

References 11 publications

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

Improving Real Time Tuning on YARN

Contact Info

Product

Resources

About