IntroductionThe DNA or protein sequence searching is the most obvious operation in the analysis of any new sequence and the reason for the same is pretty simple-finding similar regions of nucleotides or proteins between two or more nucleotide or protein sequences. The similarity can be used to determine many things including similarity of two or more species, identifying a completely new species, locating domains within the sequence of interest, etc. However, the difficulty in finding the similar regions between two or more sequences is very hard due to the size of the existing sequence involved. To overcome this difficulty, various tools or algorithms have been proposed. Let us look at some of these in the following paragraphs.In computational biology and bioinformatics, aligning sequences to determine similarity between them is an essential and widely used computational procedure for biological sequences. There have been wide range of computational algorithms applied to the sequence alignment challenge. Methods like Smith-Waterman algorithm [1], which is quite slow but accurate and is based on dynamic programming, and, basic local alignment search tool (BLAST) [2] or FASTA [3] algorithm which is faster but less accurate and is based on heuristic or probabilistic programming. The very first algorithm was
AbstractThe world of DNA sequencing has not only been a difficult field since it was first worked upon, but it is also growing at an exponential rate. The amount of data involved in DNA searching is huge, thereby normal tools or algorithms are not suitable to handle this degree of data processing. BLAST is a tool given by National Center for Biotechnology Information (NCBI) to compare nucleotide or protein sequences to sequence databases and calculate the statistical significance of matches. Many variants of BLAST such as blastn, blastp, blastx, etc. are used to search for nucleotides, proteins, nucleotides-to-proteins sequences respectively. GPU-BLAST and HBLAST have already been proposed to handle the vast amount of data involved in searching DNA sequencing and they also speedup the searching process. In this article, we propose a new model for searching DNA sequences-HCudaBLAST. It involves CUDA processing and Hadoop combined for efficient searching. The results recorded after implementing HCudaBLAST are shown. This solution combines the multi-core parallelism of GPGPUs and the scalability feature provided by the Hadoop framework. Khare et al. J Big Data (2017) et al. J Big Data (2017) 4:41 given by Smith and Waterman in the form of Smith-Waterman algorithm in 1981. This is a global sequential alignment algorithm which involves high time complexity but at the same time, it gives optimal results. To overcome the time consumption of SmithWaterman algorithm, Lipman and Pearson proposed FASTA tool in 1985, which takes a given nucleotide or amino acid sequence and searches a corresponding sequence database by using local sequence alignment. It is based on heuristic method which contributes to the high speed...