2015
DOI: 10.1007/s11390-015-1522-5
|View full text |Cite
|
Sign up to set email alerts
|

Accelerating Iterative Big Data Computing Through MPI

Abstract: Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
9
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 15 publications
(9 citation statements)
references
References 19 publications
0
9
0
Order By: Relevance
“…Other works attempted to develop novel approaches to data-centric programming. For example, in [88] the authors proposed an event-driven pipeline and in-memory shuffle using DataMPI-Iteration, which provided overlapping of computation and communication for iterative BDA computing and showed a speedup of 9× − 21× over Apache Hadoop and 2×−3× over Apache Spark for PageRank and K-means. Another approach for running data-centric applications on MPI beyond the Map-Reduce model was proposed in [89], where the authors presented a set of building blocks that provide scalable data movement capability to computational scientists and visualization researchers for writing their own parallel analysis.…”
Section: ) Process-centric Computing Models: Mpi and Openmpmentioning
confidence: 99%
“…Other works attempted to develop novel approaches to data-centric programming. For example, in [88] the authors proposed an event-driven pipeline and in-memory shuffle using DataMPI-Iteration, which provided overlapping of computation and communication for iterative BDA computing and showed a speedup of 9× − 21× over Apache Hadoop and 2×−3× over Apache Spark for PageRank and K-means. Another approach for running data-centric applications on MPI beyond the Map-Reduce model was proposed in [89], where the authors presented a set of building blocks that provide scalable data movement capability to computational scientists and visualization researchers for writing their own parallel analysis.…”
Section: ) Process-centric Computing Models: Mpi and Openmpmentioning
confidence: 99%
“…Michael et al [37] presented Spark SQL as SQL-like data processing built on the top of Spark Core. The presented Spark SQL processes data upon RDD abstraction which has many limitations in computations and data in-memory nature [12,13]. The use of the main memory by Spark SQL affects more cost-effective of the infrastructure, due to memory resources which are very expensive than disk while Hadoop processes data on disk [38].…”
Section: Related Workmentioning
confidence: 99%
“…RDD's data can be recomputed if they have lost in failure because it avoids data replication [12]. HLQL based on RDD do many iterative computations on the same data which necessities a lot of memory for keeping the data [13]. Therefore, the use of the main memory by HLQL based on RDD affects the cost of the infrastructure because memory resources are very expensive than disk.…”
mentioning
confidence: 99%
“…In (Lu et al, 2014;Lu and Liang, 2016) In (Reyes-Ortiz et al, 2015) authors compare Apache Spark performance with MPI/OpenMP based on KNN and Pegasos SVM machine learning algorithms. The results showed that MPI/OpenMP approach is still more than 10 times faster in terms of running time, however, one should note that Spark has an advantage of caching and authors did not mention this in their paper.…”
Section: Related Workmentioning
confidence: 99%