Bridging the gap between HPC and big data frameworks

Anderson, Michael J.; Smith, Shaden; Sundaram, Narayanan; Capotă, Mihai; Zhao, Zheguang; Dulloor, Subramanya R.; Satish, Nadathur; Willke, Theodore L.

doi:10.14778/3090163.3090168

Cited by 43 publications

(29 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[28] implemented 3 matrix kernels on Spark and the comparisons with C+MPI implementations showed a performance gap of 10x -40x without I/O. [29] proposed a system for integrating MPI with Spark and achieved 3.1-17.7x speedups on four graph and machine learning applications.…”

Section: Discussionmentioning

confidence: 99%

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Lin

Meng

et al. 2017

Int J Parallel Prog

View full text Add to dashboard Cite

Abstract. Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analysing metagenomics data involving several steps, some steps are data intensive, and some are compute intensive. Typical bioinformatics pipelines attempt to analyse the entire data set on computer servers with several terabytes of RAM, which is very inefficient. To overcome this limit, here we propose a MapReduce based solution to partition the data based on their species of origin. We implemented the solution using BioPig, an analytic toolkit for large-scale genomic sequence data based on Apache Hadoop and Pig. We simplified data types and logic design, compressed k-mer storage and combined Hadoop with MPI to improve the computational performance. After these optimizations, we achieved up to 193x speedup for the rate-limiting step and 8x speedup for the entire pipeline, respectively. The optimized software is also capable to process datasets that are 16 times larger on the same hardware platform. Results from this case study suggest the combined Hadoop with MPI approach has great potential in large genomics applications that are both data-intensive and compute-intensive.

show abstract

Section: Discussionmentioning

confidence: 99%

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Lin

Meng

et al. 2017

Int J Parallel Prog

View full text Add to dashboard Cite

show abstract

“…The enhancement of big data programming models can be achieved by integrating them with parallel programming models such as MPI. This approach can be seen in [4] that showed how to enable the Spark environment using the MPI libraries. Although this technique indicates remarkable speedups, it must use shared memory, and there are other overheads as a potential drawback.…”

Section: Related Workmentioning

confidence: 99%

“…Unfortunately, there is usually a performance issue when running big data applications on HPC clusters because such applications are written in high-level programming languages. Such languages may be lacking in terms of performance and may not encourage or support writing highly parallel programs in contrast to some parallel programming models like Message Passing Interface (MPI) [4]. Furthermore, these platforms are designed as a distributed architecture, which differs from the architecture of HPC clusters [5].…”

Section: Introductionmentioning

confidence: 99%

A Proposed Architecture for Parallel HPC-based Resource Management System for Big Data Applications

Shehri¹,

Khemakhem²,

Basuhail³

et al. 2019

Adv. sci. technol. eng. syst. j.

View full text Add to dashboard Cite

Big data can be considered to be at the forefront of the present and future research activities. The volume of data needing to be processed is growing dramatically in both velocity and variety. In response, many big data technologies have emerged to tackle the challenges of collecting, processing and storing such large-scale datasets. Highperformance computing (HPC) is a technology that is used to perform computations as fast as possible. This is achieved by integrating heterogeneous hardware and crafting software and algorithms to exploit the parallelism provided by HPC. The performance capabilities afforded by HPC have made it an attractive environment for supporting scientific workflows and big data computing. This has led to a convergence of the HPC and big data fields. However, big data applications usually do not fully exploit the performance available in HPC clusters. This is so due to such applications being written in high-level programming languages and do not provide support for exploiting parallelism as do other parallel programming models. The objective of this research paper is to enhance the performance of big data applications on HPC clusters without sacrificing the power consumption of HPC. This can be achieved by building a parallel HPC-based Resource Management System to exploit the capabilities of HPC resources efficiently.

show abstract

“…Several recent projects have attempted to interface Spark with MPI-based codes. One of these is Spark+MPI [1], which also invokes existing MPI-based libraries. The approach used by this project serializes the data and transfers it from Spark to an existing MPI-based library using shared memory.…”

Section: Related Workmentioning

confidence: 99%

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Gittens

Rothauge

Wang

et al. 2018

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations-in particular, many linear algebra computations that are the basis for solving common machine learning problems-are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface (MPI).To remedy this, we introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, and we provide a brief overview of its design and implementation.We also compare the performances of pure Spark implementations with those of Spark implementations that leverage MPI-based codes via Alchemist. To do so, we use data science case studies: a large-scale application of the conjugate gradient method to solve very large linear systems arising in a speech classification problem, where we see an improvement of an order of magnitude; and the truncated singular value decomposition (SVD) of a 400GB three-dimensional ocean temperature data set, where we see a speedup of up to 7.9x. We also illustrate that the truncated SVD computation is easily scalable to terabyte-sized data by applying it to data sets of sizes up to 17.6TB.

show abstract

Bridging the gap between HPC and big data frameworks

Cited by 43 publications

References 22 publications

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

A Proposed Architecture for Parallel HPC-based Resource Management System for Big Data Applications

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Contact Info

Product

Resources

About