On the Performance of Spark on HPC Systems: Towards a Complete Picture

Yildiz, Orçun; Ibrahim, Shadi

doi:10.1007/978-3-319-69953-0_5

Cited by 2 publications

(1 citation statement)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nonetheless, Big Data and HPC frameworks today remain largely incompatible: programming models and software development tools are inconsistent [5]; trying to mix both models out-of-the-box generates memory overheads and poor scalability in a HPC environment [6]; the disparity between collocated and distributed storage architectures in Big Data and HPC systems, respectively, degrades performance when running Big Data applications on HPC systems [7]; and the usage of merged Big Data models presents limitations, such as high memory consumption and low efficiency in communication between cooperating processes [8].…”

Section: Introductionmentioning

confidence: 99%

Spark-DIY: A Framework for Interoperable Spark Operations with High Performance Block-Based Data Models

Caíno‐Lores

Carretero

Nicolae

et al. 2018

2018 IEEE/ACM 5th International Conference on Big Data Computing Applications and Technologies (BDCAT)

Self Cite

View full text Add to dashboard Cite

Today's scientific applications are increasingly relying on a variety of data sources, storage facilities, and computing infrastructures, and there is a growing demand for data analysis and visualization for these applications. In this context, exploiting Big Data frameworks for scientific computing is an opportunity to incorporate high-level libraries, platforms, and algorithms for machine learning, graph processing, and streaming; inherit their data awareness and fault-tolerance; and increase productivity. Nevertheless, limitations exist when Big Data platforms are integrated with an HPC environment, namely poor scalability, severe memory overhead, and huge development effort. This paper focuses on a popular Big Data framework -Apache Spark-and proposes an architecture to support the integration of highly scalable MPI block-based data models and communication patterns with a map-reducebased programming model. The resulting platform preserves the data abstraction and programming interface of Spark, without conducting any changes in the framework, but allows the user to delegate operations to the MPI layer. The evaluation of our prototype shows that our approach integrates Spark and MPI efficiently at scale, so end users can take advantage of the productivity facilitated by the rich ecosystem of high-level Big Data tools and libraries based on Spark, without compromising efficiency and scalability.

show abstract