DataMPI: Extending MPI to Hadoop-Like Big Data Computing

Lu, Xiaoyi; Fan, Lei; Wang, Bing; Zha, Li; Xu, Zhiwei

doi:10.1109/ipdps.2014.90

Cited by 49 publications

(32 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[25] analysed more applications and highlighted areas where HPC and Apache Big Data Stack have good opportunities for integration on the base of [24]. DataMPI [26] tried to extend MPI to support Hadoop-like Big Data Computing jobs. It showed performance and flexibility benefits while maintaining high productivity, scalability, and fault tolerance of Hadoop.…”

Section: Discussionmentioning

confidence: 99%

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Lin

Meng

et al. 2017

Int J Parallel Prog

View full text Add to dashboard Cite

Abstract. Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analysing metagenomics data involving several steps, some steps are data intensive, and some are compute intensive. Typical bioinformatics pipelines attempt to analyse the entire data set on computer servers with several terabytes of RAM, which is very inefficient. To overcome this limit, here we propose a MapReduce based solution to partition the data based on their species of origin. We implemented the solution using BioPig, an analytic toolkit for large-scale genomic sequence data based on Apache Hadoop and Pig. We simplified data types and logic design, compressed k-mer storage and combined Hadoop with MPI to improve the computational performance. After these optimizations, we achieved up to 193x speedup for the rate-limiting step and 8x speedup for the entire pipeline, respectively. The optimized software is also capable to process datasets that are 16 times larger on the same hardware platform. Results from this case study suggest the combined Hadoop with MPI approach has great potential in large genomics applications that are both data-intensive and compute-intensive.

show abstract

Section: Discussionmentioning

confidence: 99%

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Lin

Meng

et al. 2017

Int J Parallel Prog

View full text Add to dashboard Cite

show abstract

“…DataMPI [11][12] is a key-value based communication library which extends MPI for big data applications. The design of DataMPI is based on the bipartite model, which defines the communication behavior between two build-in communicators as COMM BIPARTITLE O and COMM BIPARTITLE A.…”

Section: Overview Of Datampimentioning

confidence: 99%

“…The trend of converging big data and high performance computing (HPC) is emerging [6][7][8][9][10] . As a specific example of this trend, DataMPI [11][12] is proposed, which aims at extending MPI by a key-value pair based communication operations to provide high performance communication in large-scale data computing scenario. Considering different data structures, communication styles, and optimization methodologies in data computing, multiple programming paradigms are supported in DataMPI.…”

Section: Introductionmentioning

confidence: 99%

Accelerating Iterative Big Data Computing Through MPI

Fan

2015

J. Comput. Sci. Technol.

Self Cite

View full text Add to dashboard Cite

Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPIIteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X∼21X speedup over Apache Hadoop, and 2X∼3X speedup over Apache Spark for PageRank and K-means.

show abstract

“…A comparison of architecture and abstractions between HPC and Apache Big Data Stacks (ABDS) is presented in [1] and the authors argued that a convergence between the two at many levels can be observed. While regular Hadoop uses the Java-based Netty 9 package for distributed communication, several works have proposed to use Message Passing Interface (MPI 10 ) libraries, which are typically C/C++ based, to achieve better performance, especially on HPC clusters with high-speed networks [9] [2]. A comprehensive assessment on the performance impact of highspeed interconnects (including 10Gbps Ethernet and Infiniband) on MapReduce is presented in [3].…”

Section: Background and Motivationmentioning

confidence: 99%

“…While Big Data software packages, such as Hadoop, were initially developed for inexpensive commodity workstations, as multi-core machines equipped with large memory capacities and hardware accelerators are becoming increasingly affordable, new Big Data systems that can take advantages of new hardware features and deliver high performance, such as Apache Spark 1 and Cloudera Impala 2 for in-memory and in-network processing, are becoming more preferable. As a result, there are growing interests on using High Performance Computing (HPC) facilities that are typically equipped with powerful processors (including accelerators) and high speed networks for Big Data applications [1] [2][3] [4]. Unfortunately, accesses to HPC facilities are very often restrictive and it is very difficult (if not impossible) to reconfigure HPC platforms for research purposes.…”

Section: Introductionmentioning

confidence: 99%

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation

Zhang

You

Gruenwald

2015

2015 IEEE 35th International Conference on Distributed Computing Systems Workshops

View full text Add to dashboard Cite

GPU-equipped computing nodes have much higher ratios between floating point computing power (in the order of TFlops and is fast growing) and network bandwidth (in the order of Gbps and remains stable) than regular computing nodes at which Hadoop-based systems are targeting. The gap makes efficient and scalable processing of large-scale data challenging, especially for geo-referenced spatial (or geospatial) data, whose processing is both data intensive and computing intensive. We aim at developing a tiny GPU cluster using Nvidia Tegra K1 (TK1) System on Chip (SoC) boards as a downscaled, low-cost GPU cluster for Big (Spatial) Data research. The tiny GPU cluster is equipped with standard gigabyte Ethernet network while has much less computing power and energy footprint when compared with a regular GPU cluster and represents a new platform with more balanced compute to communication ratio. We have ported our implementations of both single-node technologies for point-in-polygon test based spatial joins and the lightweight distributed execution engine originally developed for regular clusters to the tiny GPU cluster. We evaluate its performance on two real world geospatial applications with various settings and experiment results have demonstrated good scalability. Preliminary analysis on the scaling effect between the tiny cluster and a regular Amazon EC2 cluster using a simplified model suggest that the ARM-based CPU of the TK1 board is likely to achieve better energy efficiency while the Nvidia GPU of the TK1 board might be less efficient when compared with desktop/server grade GPUs, in both standalone and 4-node cluster settings.

show abstract

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

Cited by 49 publications

References 10 publications

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Accelerating Iterative Big Data Computing Through MPI

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation

Contact Info

Product

Resources

About