Modeling biology as classical problems in computer science allows researchers to leverage the wealth of theoretical advancements in this field. Despite countless studies presenting heuristics that report improvement on specific benchmarking data, there has been comparatively little focus on exploring the theoretical bounds on the performance of practical (polynomial-time) algorithms. Conversely, theoretical studies tend to overstate the generalizability of their conclusions to physical biological processes. In this article we provide a fresh perspective on the concepts of NP-hardness and inapproximability in the computational biology domain, using popular sequence assembly and alignment (mapping) algorithms as illustrative examples. These algorithms exemplify how computer science theory can both (a) lead to substantial improvement in practical performance and (b) highlight areas ripe for future innovation. Importantly, we discuss caveats that seemingly allow the performance of heuristics to exceed their provable bounds.Keywords: algorithms, inapproximability, genomics, alignment.
SEQUENCE ASSEMBLY: WHERE THEORY MEETS PRACTICEG iven a set of n strings, S = fs 1 ; s 2 ; . . . ‚ s n g, the goal of the shortest common superstring problem (SCSP) is to find the minimum length string, s, such that each s i 2 S is a substring of s. The SCSP over the nucleotide alphabet, S = fA‚ C‚ G‚ Tg, thus provides a simple and convenient model for the sequence assembly problem, whereby we wish to determine the DNA sequence from which a set of reads (or k-mers) are derived. This is a classic example of how decades of research on approximation bounds of NP-hard problems can be applied to improve the practical performance of algorithms in the computational biology domain.A detailed review on the development of approximation and hardness bounds for SCSP is provided by Golovnev et al. (2013). Despite these advancements, the power of theoretical computer science abstractions is limited by how closely they represent the true biological problem (as we discuss later)-in this case, reversing the DNA fragmentation process inherent to high-throughput sequencing experiments. SCSP and its sequence assembly derivatives (Sweedyk, 2000;Kaplan and Shafrir, 2005) have thus been criticized for their assumptions regarding parsimony and tandem repeats (Nagarajan and Pop, 2009), motivating the application of graph theoretic models that make more appropriate sets of assumptions.