2017
DOI: 10.14778/3090163.3090168
|View full text |Cite
|
Sign up to set email alerts
|

Bridging the gap between HPC and big data frameworks

Abstract: Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
29
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 43 publications
(29 citation statements)
references
References 22 publications
0
29
0
Order By: Relevance
“…[28] implemented 3 matrix kernels on Spark and the comparisons with C+MPI implementations showed a performance gap of 10x -40x without I/O. [29] proposed a system for integrating MPI with Spark and achieved 3.1-17.7x speedups on four graph and machine learning applications.…”
Section: Discussionmentioning
confidence: 99%
“…[28] implemented 3 matrix kernels on Spark and the comparisons with C+MPI implementations showed a performance gap of 10x -40x without I/O. [29] proposed a system for integrating MPI with Spark and achieved 3.1-17.7x speedups on four graph and machine learning applications.…”
Section: Discussionmentioning
confidence: 99%
“…The enhancement of big data programming models can be achieved by integrating them with parallel programming models such as MPI. This approach can be seen in [4] that showed how to enable the Spark environment using the MPI libraries. Although this technique indicates remarkable speedups, it must use shared memory, and there are other overheads as a potential drawback.…”
Section: Related Workmentioning
confidence: 99%
“…Unfortunately, there is usually a performance issue when running big data applications on HPC clusters because such applications are written in high-level programming languages. Such languages may be lacking in terms of performance and may not encourage or support writing highly parallel programs in contrast to some parallel programming models like Message Passing Interface (MPI) [4]. Furthermore, these platforms are designed as a distributed architecture, which differs from the architecture of HPC clusters [5].…”
Section: Introductionmentioning
confidence: 99%
“…Several recent projects have attempted to interface Spark with MPI-based codes. One of these is Spark+MPI [1], which also invokes existing MPI-based libraries. The approach used by this project serializes the data and transfers it from Spark to an existing MPI-based library using shared memory.…”
Section: Related Workmentioning
confidence: 99%