Structural alignments of Ribonucleic acid (RNA) sequences solved by the Sankoff algorithm are computationally expensive and often require constraints to be used in practice. Modern Graphics Processing Units (GPUs) contain more than 1000 cores, which compute in parallel to speed up applications. Here, we present a GPU-based solution to the RNA structural alignment problem that makes use of precalculated base pair probabilities on the individual sequences.We designed and developed an unconstrained version of the Sankoff algorithm, obtaining the optimal result and calculating the entire four-dimension dynamic programming matrix (4D DP).Our approach uses a two-level wavefront strategy to exploit parallelism. The 4D DP matrix is divided in one external matrix (EM) and several internal matrices (IM). We applied wavefront strategies on the EM and IMs in a two-level hierarchical way. At the first level, the wavefront is applied to the EM, calculating the cells that belong to the same diagonal in parallel. In the second level, since each cell in the EM is itself an IM matrix, the cells that belong to the same IM diagonal are calculated in parallel. The results obtained with real RNA sequences show that our GPU version is capable of outperforming a multicore CPU version of the unconstrained version of the Sankoff algorithm. Compared with the CPU-based version running on 32 cores, our approach is able to achieve a speedup of 7.81x on the NVidia Tesla P100. In this case, the execution time was reduced from 6 hours and 18 minutes (32 cores) to 48 minutes and 20 seconds (GPU).
KEYWORDSbase-pairing probabilities, GPUs, high-performance computing, RNA, Sankoff algorithm
INTRODUCTIONBioinformatics is an interdisciplinary area that involves computer science, statistics, mathematics, and biology, aiming at proposing algorithms and tools to help biologists in their data analysis. Sequence comparison is one of the most basic operations in Bioinformatics, and its goal is to expose the evolutionary relationship among two or more sequences by computing a similarity score and an alignment. A comparison operation can take into account the biological sequences themselves (1D comparison) or their structure (2D and 3D comparison).Structural comparisons are usually done for Ribonucleic acid (RNA) sequences, which fold over themselves, generating a structure.Nussinov et al 1 proposed an algorithm based on dynamic programming to compute the 2D structure of a given RNA sequence. This algorithm was extended to full energy calculation in several versions, eg, 2 and further to a full structural alignment by Sankoff 3 for more than one sequence, giving as output the 2D consensus structure of a set of sequences, folding and aligning the sequences simultaneously. Sankoff's algorithm runs in O(L 3N ) time and uses O(L 2N ) memory, for N sequences of length L, and provides the optimal solution.Due to its high computational cost and huge memory requirements, Sankoff's algorithm is considered unfeasible. For this reason, many practical Bioinformatics tool...