HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack

Fox, Geoffrey; Qiu, Judy; Kamburugamuve, Supun; Jha, Shantenu; Luckow, André

doi:10.1109/ccgrid.2015.122

Cited by 28 publications

(19 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[24] proposed a preliminary implementation of the high-performance big data system (HPBDS) called High-Performance Computing-Big Data Stack (HPC-ABDS). [25] analysed more applications and highlighted areas where HPC and Apache Big Data Stack have good opportunities for integration on the base of [24]. DataMPI [26] tried to extend MPI to support Hadoop-like Big Data Computing jobs.…”

Section: Discussionmentioning

confidence: 99%

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Lin

Meng

et al. 2017

Int J Parallel Prog

View full text Add to dashboard Cite

Abstract. Metagenomics, the study of all microbial species cohabitants in an environment, often produces large amount of sequence data varying from several GBs to a few TBs. Analysing metagenomics data involving several steps, some steps are data intensive, and some are compute intensive. Typical bioinformatics pipelines attempt to analyse the entire data set on computer servers with several terabytes of RAM, which is very inefficient. To overcome this limit, here we propose a MapReduce based solution to partition the data based on their species of origin. We implemented the solution using BioPig, an analytic toolkit for large-scale genomic sequence data based on Apache Hadoop and Pig. We simplified data types and logic design, compressed k-mer storage and combined Hadoop with MPI to improve the computational performance. After these optimizations, we achieved up to 193x speedup for the rate-limiting step and 8x speedup for the entire pipeline, respectively. The optimized software is also capable to process datasets that are 16 times larger on the same hardware platform. Results from this case study suggest the combined Hadoop with MPI approach has great potential in large genomics applications that are both data-intensive and compute-intensive.

show abstract

Section: Discussionmentioning

confidence: 99%

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Lin

Meng

et al. 2017

Int J Parallel Prog

View full text Add to dashboard Cite

show abstract

“…There are synergies between HPC and big data systems, and authors 29,30 among others 31 have expressed the need to enhance these systems by taking ideas from each other. In previous work 32,33 we have identified the general implications of threads and processes, cache, memory management in NUMA 34 , as well as multi-core settings for machine learning algorithms with MPI.…”

Section: Related Workmentioning

confidence: 99%

Twister2: Design of a big data toolkit

Kamburugamuve

Govindarajan

Wickramasinghe

et al. 2019

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Data-driven applications are essential to handle the ever-increasing volume, velocity, and veracity of data generated by sources such as the Web and Internet of Things (IoT) devices. Simultaneously, an event-driven computational paradigm is emerging as the core of modern systems designed for database queries, data analytics, and on-demand applications. Modern big data processing runtimes and asynchronous many task (AMT) systems from high performance computing (HPC) community have adopted dataflow event-driven model. The services are increasingly moving to an event-driven model in the form of Function as a Service (FaaS) to compose services. An event-driven runtime designed for data processing consists of well-understood components such as communication, scheduling, and fault tolerance. Different design choices adopted by these components determine the type of applications a system can support efficiently. We find that modern systems are limited to specific sets of applications because they have been designed with fixed choices that cannot be changed easily. In this paper, we present a loosely coupled component-based design of a big data toolkit where each component can have different implementations to support various applications. Such a polymorphic design would allow services and data analytics to be integrated seamlessly and expand from edge to cloud to HPC environments.

show abstract

“…US President Obama's recent executive order (Obama, 2015) sets a new agenda for the creation of exascale computing facilities (capable of 10 15 mathematical operations per second) while calling for joint development of exascale and big data/cloud computing facilities. Fox and collaborators (Fox, Qui, Kamburugamuve, Jha, & Luckow, 2015) depict many of the commonalities between cloud computing and HPC and propose an alignment and set of commonalities between HPC and big data stacks that can form a foundation for the sort of joint development of both approaches called for by President Obama in his recent executive order.…”

Section: Economics Of Computingmentioning

confidence: 99%

Cyberinfrastructure, Cloud Computing, Science Gateways, Visualization, and Cyberinfrastructure Ease of Use

Stewart

Knepper

Link

et al. 2019

Advances in Computer and Electrical Engineering

View full text Add to dashboard Cite

Computers accelerate our ability to achieve scientific breakthroughs. As technology evolves and new research needs come to light, the role for cyberinfrastructure as “knowledge” infrastructure continues to expand. In essence, cyberinfrastructure can be thought of as the integration of supercomputers, data resources, visualization, and people that extends the impact and utility of information technology. This chapter discusses cyberinfrastructure, the related topics of science gateways and campus bridging, and identifies future challenges and opportunities in cyberinfrastructure.

show abstract

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack

Cited by 28 publications

References 4 publications

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Combining Hadoop with MPI to Solve Metagenomics Problems that are both Data- and Compute-intensive

Twister2: Design of a big data toolkit

Cyberinfrastructure, Cloud Computing, Science Gateways, Visualization, and Cyberinfrastructure Ease of Use

Contact Info

Product

Resources

About