Big data processing tools: An experimental performance evaluation

Rodrigues, Mário; Santos, Maribel Yasmina; Bernardino, Jorge

doi:10.1002/widm.1297

Cited by 18 publications

(23 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The work of [9] benchmarks different SQLon-Hadoop systems (Hive, Spark, Presto and Drill) using the Star Schema Benchmark (SSB), also used in [10], testing Hive and Presto using different partitioning and bucketing strategies. In [6], Drill, HAWQ, Hive, Impala, Presto and Spark were benchmarked showing the advantages of in-memory processing tools like HAWQ, Impala and Presto. Although the good performance of these in-memory processing tools, this work also shows the increase in the processing time that is verified when these tools do not have enough RAM memory and activate the "Spill to Disk" functionality, making use of secondary memory.…”

Section: Related Workmentioning

confidence: 99%

“…The Big Data concept also impacts the traditional Data Warehouse (DW), leading to a Big Data Warehouse (BDW) with the same goals in terms of data integration and decision-making support, but addressing Big Data characteristics [4], [5] such as massively parallel processing; mixed and complex analytical workloads (e.g., ad hoc querying, data mining, text mining, exploratory analysis and materialized views); flexible storage to support data from several sources or real-time operations (stream processing, low latency and high frequency updates), only to mention a few. Also, SQL-on-Hadoop systems are increasing their notoriety, looking for interactive and low latency query executions, providing timely analytics to support the decision-making process, in which each second counts [6]. Aligned with the research trends of supporting OLAP (Online Analytical Processing) workloads and aggregations over Big Data [7], this paper compares Apache Druid, which promises fast aggregations on Big Data environments [8], with two well-known SQL-on-Hadoop systems, Hive and Presto.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Challenging SQL-on-Hadoop Performance with Apache Druid

Correia

Costa

Santos

2019

Business Information Systems

Self Cite

View full text Add to dashboard Cite

In Big Data, SQL-on-Hadoop tools usually provide satisfactory performance for processing vast amounts of data, although new emerging tools may be an alternative. This paper evaluates if Apache Druid, an innovative columnoriented data store suited for online analytical processing workloads, is an alternative to some of the well-known SQL-on-Hadoop technologies and its potential in this role. In this evaluation, Druid, Hive and Presto are benchmarked with increasing data volumes. The results point Druid as a strong alternative, achieving better performance than Hive and Presto, and show the potential of integrating Hive and Druid, enhancing the potentialities of both tools.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Challenging SQL-on-Hadoop Performance with Apache Druid

Correia

Costa

Santos

2019

Business Information Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…The goal of this fourth phase is to select the most suitable technology for non-raw data repositories that provide analytical capabilities (REP). Below, we specify the requirements from Table 1 that have more importance [5], [30], [31] in the selection of the technology for the analytical repository, also called Big Data Warehouse [16]:…”

Section: F Phase 4 Analytical Repositories Implementationmentioning

confidence: 99%

An Iterative Methodology for Defining Big Data Analytics Architectures

Tardío¹,

Maté

Trujillo

2020

IEEE Access

View full text Add to dashboard Cite

Thanks to the advances achieved in the last decade, the lack of adequate technologies to deal with Big Data characteristics such as Data Volume is no longer an issue. Instead, recent studies highlight that one of the main Big Data issues is the lack of expertise to select adequate technologies and build the correct Big Data architecture for the problem at hand. In order to tackle this problem, we present our methodology for the generation of Big Data pipelines based on several requirements derived from Big Data features that are critical for the selection of the most appropriate tools and techniques. Thus, thanks to our approach we reduce the required know-how to select and build Big Data architectures by providing a step-by-step methodology that leads Big Data architects into creating their Big Data Pipelines for the case at hand. Our methodology has been tested in two use cases.

show abstract

“…Considering that the Hive has original support for ORC file and ORC file format has good encodings, compression algorithms, and multi-dimensional statistical information [29], the data in this paper is stored in ORC file format.…”

Section: Hive File Formatmentioning

confidence: 99%

A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail Records

2020

View full text Add to dashboard Cite

With the dramatic rise of mobile internet users and the administrative requirements of longterm data retention, telecom providers are facing increasingly challenging storage and retrieval issues of call detail records (CDRs). The existing storage system can only achieve the requirement of online query and offline analysis of the CDRs. However, to the best of our knowledge, few studies have focused on the topic of CDRs retrieval optimization with long-term storage. In order to improve the retrieval speed while ensuring a high compression ratio, in this paper we propose a novel hash storage scheme, termed dual-column bucketing (DCB), based on the Hive platform by making use of its Bucketing nature. Compared to the conventional scheme, the proposed DCB scheme can improve the performance both for CDRs compression and query. Second, similar storage scenarios such as storage of SMS, email and extended detail records (XDRs) are included in the optimization scope of the DCB. Experiments on real-world CDRs show that in contrast to the conventional scheme, the proposed DCB scheme can save the storage space by approximately 40%, reduces the amount of disk read to 2%, and improve the retrieval speed of known phone number queries by up to seven times. INDEX TERMS Bucketing, call detail records, hash storage, long-term storage.

show abstract

Big data processing tools: An experimental performance evaluation

Cited by 18 publications

References 20 publications

Challenging SQL-on-Hadoop Performance with Apache Druid

Challenging SQL-on-Hadoop Performance with Apache Druid

An Iterative Methodology for Defining Big Data Analytics Architectures

A Hive-Based Retrieval Optimization Scheme for Long-Term Storage of Massive Call Detail Records

Contact Info

Product

Resources

About