HadoopCL: MapReduce on Distributed Heterogeneous Platforms through Seamless Integration of Hadoop and OpenCL

IntroductionThe DNA or protein sequence searching is the most obvious operation in the analysis of any new sequence and the reason for the same is pretty simple-finding similar regions of nucleotides or proteins between two or more nucleotide or protein sequences. The similarity can be used to determine many things including similarity of two or more species, identifying a completely new species, locating domains within the sequence of interest, etc. However, the difficulty in finding the similar regions between two or more sequences is very hard due to the size of the existing sequence involved. To overcome this difficulty, various tools or algorithms have been proposed. Let us look at some of these in the following paragraphs.In computational biology and bioinformatics, aligning sequences to determine similarity between them is an essential and widely used computational procedure for biological sequences. There have been wide range of computational algorithms applied to the sequence alignment challenge. Methods like Smith-Waterman algorithm [1], which is quite slow but accurate and is based on dynamic programming, and, basic local alignment search tool (BLAST) [2] or FASTA [3] algorithm which is faster but less accurate and is based on heuristic or probabilistic programming. The very first algorithm was AbstractThe world of DNA sequencing has not only been a difficult field since it was first worked upon, but it is also growing at an exponential rate. The amount of data involved in DNA searching is huge, thereby normal tools or algorithms are not suitable to handle this degree of data processing. BLAST is a tool given by National Center for Biotechnology Information (NCBI) to compare nucleotide or protein sequences to sequence databases and calculate the statistical significance of matches. Many variants of BLAST such as blastn, blastp, blastx, etc. are used to search for nucleotides, proteins, nucleotides-to-proteins sequences respectively. GPU-BLAST and HBLAST have already been proposed to handle the vast amount of data involved in searching DNA sequencing and they also speedup the searching process. In this article, we propose a new model for searching DNA sequences-HCudaBLAST. It involves CUDA processing and Hadoop combined for efficient searching. The results recorded after implementing HCudaBLAST are shown. This solution combines the multi-core parallelism of GPGPUs and the scalability feature provided by the Hadoop framework. Khare et al. J Big Data (2017) et al. J Big Data (2017) 4:41 given by Smith and Waterman in the form of Smith-Waterman algorithm in 1981. This is a global sequential alignment algorithm which involves high time complexity but at the same time, it gives optimal results. To overcome the time consumption of SmithWaterman algorithm, Lipman and Pearson proposed FASTA tool in 1985, which takes a given nucleotide or amino acid sequence and searches a corresponding sequence database by using local sequence alignment. It is based on heuristic method which contributes to the high speed...

show abstract

“…-Java Aparapi [12]: it converts Java byte code to OpenCL at runtime. It also works with Hadoop and it supports different kind of GPUs.…”

Section: Proposed Workmentioning

confidence: 99%

HCudaBLAST: an implementation of BLAST on Hadoop and Cuda

Khare

Khan

2017

J Big Data

View full text Add to dashboard Cite

show abstract

“…Another problem with Hadoop is it consumes high energy. Considering these problems in mind, a recommender system is implemented on Hadoop CL 13 . Hadoop CL uses Open CL to utilize the resources like cores of CPUs, GPUs, APUs, FPGAs, etc.…”

Section: Proposed Workmentioning

confidence: 99%

A Survey on Accelerated Mapreduce for Hadoop

2017

View full text Add to dashboard Cite

Big Data is defined by 3Vs which stands for variety, volume and velocity. The volume of data is very huge, data exists in variety of file types and data grows very rapidly. Big data storage and processing has always been a big issue. Big data has become even more challenging to handle these days. To handle big data high performance techniques have been introduced. Several frameworks like Apache Hadoop has been introduced to process big data. Apache Hadoop provides map/reduce to process big data. But this map/reduce can be further accelerated. In this paper a survey has been performed for map/reduce acceleration and energy efficient computation in quick time.

show abstract

“…HadoopCL [5], an integration of Hadoop and OpenCL. The purpose of HadoopCL is to enable the use of heterogeneous process in distributed system.…”

Section: Hadoopclmentioning

confidence: 99%

GPU based Suffix Array Pattern Matching Approach for Big Data

Katoch¹,

Silakari²,

Chourasia³

2017

IJCA

View full text Add to dashboard Cite

Big data has been an emerging problem these days. To solve this problem Hadoop has evolved as a most widely used tool and adopted by various popular MNCs like Facebook and Yahoo. To search large number of pattern in big data is a challenging task. Map/Reduce is used to write codes to perform pattern matching on big data. In this work OpenCL is combined with Apache Hadoop to write fast Map/Reduce for pattern matching in data using suffix arrays.

show abstract

HadoopCL: MapReduce on Distributed Heterogeneous Platforms through Seamless Integration of Hadoop and OpenCL

Cited by 39 publications

References 4 publications

HCudaBLAST: an implementation of BLAST on Hadoop and Cuda

HCudaBLAST: an implementation of BLAST on Hadoop and Cuda

A Survey on Accelerated Mapreduce for Hadoop

GPU based Suffix Array Pattern Matching Approach for Big Data

Contact Info

Product

Resources

About