The Efficiency of MapReduce in Parallel External Memory

Greiner, Gero; Jacob, Riko

doi:10.1007/978-3-642-29344-3_37

Cited by 6 publications

(3 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here we generalize this idea to the processing of any conjunctive query in a rigorous way. We should also note that previous work [9] has studied the simulation of MapReduce algorithms on a parallel external memory model.…”

Section: Simulating An Mpc Algorithmmentioning

confidence: 99%

Worst-Case Optimal Algorithms for Parallel Query Processing

Beame¹,

Koutris²,

Suciu³

2016

Preprint

View full text Add to dashboard Cite

In this paper, we study the communication complexity for the problem of computing a conjunctive query on a large database in a parallel setting with p servers. In contrast to previous work, where upper and lower bounds on the communication were specified for particular structures of data (either data without skew, or data with specific types of skew), in this work we focus on worst-case analysis of the communication cost. The goal is to find worst-case optimal parallel algorithms, similar to the work of [18] for sequential algorithms.We first show that for a single round we can obtain an optimal worst-case algorithm. The optimal load for a conjunctive query q when all relations have size equal to M is O(M/p 1/ψ * ), where ψ * is a new query-related quantity called the edge quasi-packing number, which is different from both the edge packing number and edge cover number of the query hypergraph. For multiple rounds, we present algorithms that are optimal for several classes of queries. Finally, we show a surprising connection to the external memory model, which allows us to translate parallel algorithms to external memory algorithms. This technique allows us to recover (within a polylogarithmic factor) several recent results on the I/O complexity for computing join queries, and also obtain optimal algorithms for other classes of queries.

show abstract

Section: Simulating An Mpc Algorithmmentioning

confidence: 99%

Worst-Case Optimal Algorithms for Parallel Query Processing

Beame¹,

Koutris²,

Suciu³

2016

Preprint

View full text Add to dashboard Cite

show abstract

“…Theoretical consideration was given in [23], where the authors present upper and lower bounds on the parallel I/O complexity of the shuffle phase, bounding the worst-case performance loss of the MapReduce approach in terms of I/O-efficiency. Shared environment optimizations for Hadoop MapReduce based on pre-fetching and pre-shuffling were explored in [24].…”

Section: Related Workmentioning

confidence: 99%

Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics

Nicolae

Costa

Misale³

et al. 2017

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-Big data analytics is an indispensable tool in transforming science, engineering, medicine, health-care, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. In this context, data shuffling, a particularly difficult transformation pattern, introduces important challenges. Specifically, data shuffling is a key component of complex computations that has a major impact on the overall performance and scalability. Thus, speeding up data shuffling is a critical goal. To this end, state-of-the-art solutions often rely on overlapping the data transfers with the shuffling phase. However, they employ simple mechanisms to decide how much data and where to fetch it from, which leads to sub-optimal performance and excessive auxiliary memory utilization for the purpose of prefetching. The latter aspect is a growing concern, given evidence that memory per computation unit is continuously decreasing while interconnect bandwidth is increasing. This paper contributes a novel shuffle data transfer strategy that addresses the two aforementioned dimensions by dynamically adapting the prefetching to the computation. We implemented this novel strategy in Spark, a popular in-memory data analytics framework. To demonstrate the benefits of our proposal, we run extensive experiments on an HPC cluster with large core count per node. Compared with the default Spark shuffle strategy, our proposal shows: up to 40% better performance with 50% less memory utilization for buffering and excellent weak scalability.

show abstract

“…With respect to data shuffling itself, the problem has been explored from multiple perspectives. Theoretical consideration was given in [11], where the authors present upper and lower bounds on the parallel I/O complexity of the shuffle phase. Low-level optimizations of the networking layer where data shuffling is explored in the context of high performance interconnects such as InfiniBand exist both for MapReduce [12] and Spark [13].…”

Section: Related Workmentioning

confidence: 99%

Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics

Nicolae

Costa

Misale

et al. 2016

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

View full text Add to dashboard Cite

Big data analytics is an indispensable tool in transforming science, engineering, medicine, healthcare, finance and ultimately business itself. With the explosion of data sizes and need for shorter time-to-solution, in-memory platforms such as Apache Spark gain increasing popularity. However, this introduces important challenges, among which data shuffling is particularly difficult: on one hand it is a key part of the computation that has a major impact on the overall performance and scalability so its efficiency is paramount, while on the other hand it needs to operate with scarce memory in order to leave as much memory available for data caching. In this context, efficient scheduling of data transfers such that it addresses both dimensions of the problem simultaneously is non-trivial. State-of-the-art solutions often rely on simple approaches that yield sub-optimal performance and resource usage. This paper contributes a novel shuffle data transfer strategy that dynamically adapts to the computation with minimal memory utilization, which we briefly underline as a series of design principles.

show abstract

The Efficiency of MapReduce in Parallel External Memory

Cited by 6 publications

References 14 publications

Worst-Case Optimal Algorithms for Parallel Query Processing

Worst-Case Optimal Algorithms for Parallel Query Processing

Leveraging Adaptive I/O to Optimize Collective Data Shuffling Patterns for Big Data Analytics

Towards Memory-Optimized Data Shuffling Patterns for Big Data Analytics

Contact Info

Product

Resources

About