Universal arrays contain all possible oligonucleotides of a certain length, typically 6 -10 bases. They can determine in a single experiment all substrings of that length that occur along a target sequence. That information, also called the spectrum of the sequence, is not sufficient to uniquely reconstruct a sequence longer than a few hundred bases. We have devised a polynomial algorithm that reconstructs the sequence, given the spectrum and an additional reference sequence, homologous to the target sequence. Such a reference is available, for example, in the identification of single-nucleotide polymorphisms. The algorithm can handle errors in the spectrum as well as substitutions, insertions, and deletions in the target sequence. We present extensive simulation results, which show that the algorithm correctly reconstructs target sequences of >2,000 nucleotides from error-prone 8-mer spectra when realistic levels of single-nucleotide polymorphisms are present.sequencing by hybridization ͉ mutation detection ͉ SNP genotyping ͉ hidden Markov models ͉ DNA microarrays
Sequencing by HybridizationS equencing by hybridization (SBH) was invented in the late 1980s as an alternative to gel-based sequencing (1-3). This method makes use of a universal DNA microarray, which harbors all oligonucleotides of length k (called k-words, or simply words when k is clear). These oligonucleotides are hybridized to an unknown DNA fragment, whose sequence we would like to determine. Under ideal conditions, this target molecule would hybridize to all words whose Watson-Crick complements occur somewhere along its sequence. Thus, in principle, one could determine in a single microarray reaction the set of all k-long substrings of the target and try to infer the sequence from those data. The technique was validated in arrays of 7 and 8 mers (4, 5), and up to 10 mers are possible with current array technology.The fundamental computational problem in SBH is the reconstruction of a sequence from its spectrum, the set of all words occurring along the sequence. Pevzner (6) reduced that problem (assuming the number of occurrences of each word is known) to the polynomial task of finding an Eulerian path in a graph.The main weakness of SBH is ambiguous solutions: When several sequences have the same spectrum, there is no way to determine the true sequence. Theoretical analysis and simulations (4, 7) have shown that even when the spectrum is errorless and contains the multiplicity of each word, the average length of a uniquely reconstructible sequence using an 8-mer array is Ͻ200 bases, far below a single read length on a commercial gel-lane machine.Although an effective and competitive sequencing solution using SBH has yet to be demonstrated, this strategy continues to attract attention. In principle, SBH holds promise to considerably economize on the task of sequencing, one of the major efforts in modern biotechnology. Alternative array designs (8-10) as well as interactive protocols (11) were suggested.
Similar Sequences Are UbiquitousSimilarity...