Dynamic Programming (DP) provides optimal solutions to a problem by combining optimal solutions to many overlapping subproblems. DP algorithms exploit this overlapping property to explore otherwise exponential-sized problem spaces in polynomial time, making them central to many important applications spanning from logistics to computational biology. In this paper, we present a general strategy of obtaining highly efficient parallel DP implementations using recursive cache-oblivious divide and conquer technique which turns inflexible kernels into flexible ones (kernels that read from and write to disjoint sub-matrices). We solve four non-trivial DP problems widely used in Bioinformatics, namely the parenthesis problem, Floyd-Warshall's all-pairs shortest paths, gap problem and protein accordion folding using recursive cache-oblivious technique that decompose the original inflexible looping kernel to highly optimizable flexible kernels. To the best of our knowledge no such recursive parallel DP algorithms were known for the protein folding and gap problems. The algorithms are hybrid in the same way as most high-performance matrix multiplication algorithms are recursive with iterative base cases. We show that the base cases of these recursive divide-and-conquer algorithms are predominantly matrix-multiplication-like (MMlike) flexible that expose many optimization opportunities not offered by the traditional looping DP codes. Moreover, the most costly/dominating kernel for these problems are often flexible. As a result, one can obtain highly efficient DP implementations by simply optimizing these kernels. We present a few generic optimization steps that suffices to optimize these DP implementations. Our implementations achieve 5 − 100× speedup over their standard loop based DP counterparts on modern multicore machines. We also present results on manycores (Xeon Phi) and clusters of multicores obtained by simple extensions for SIMD and shareddistributed-shared-memory architectures, respectively.Dynamic programs are traditionally implemented using simple loop-based algorithms which are straightforward to implement, have good spatial locality 1 , and benefit from hardware prefetchers. However, looping codes suffer in performance from poor temporal cache locality 2 . Low temporal locality leads to increased pressure on memory bandwidth which increases with the number of active cores. Hence, there is significant room for improvement in the cache usage of these algorithms, and consequently also in running times, especially on parallel machines. Iterative DP implementations are often inflexible in the sense that the loops and the data in the DP table cannot be suitably reordered in order to optimize for better spatial locality, parallelization and/or vectorization. Such inflexibility arises from the fact that the codes often read from and write to the same DP table, and thus imposing strict read-write ordering of the cells.Recursive divide-and-conquer DP algorithms can often overcome many limitations of their iterative coun...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.