New DNA sequencing technologies can sequence up to one billion bases in a single day at low cost, putting large-scale sequencing within the reach of many scientists. Many researchers are forging ahead with projects to sequence a range of species using the new technologies. However, these new technologies produce read lengths as short as 35-40 nucleotides, posing challenges for genome assembly and annotation. Here we review the challenges and describe some of the bioinformatics systems that are being proposed to solve them. We specifically address issues arising from using these technologies in assembly projects, both de novo and for resequencing purposes, as well as efforts to improve genome annotation in the fragmented assemblies produced by short read lengths.
New technologies: more data and new types of dataThe ongoing revolution in sequencing technology has led to the production of sequencing machines with dramatically lower costs and higher throughput than the technology of just 2 years ago. Sequencers from 454 Life Sciences/Roche, Solexa/Illumina and Applied Biosystems (SOLiD technology) are already in production, and a competing technology from Helicos should appear soon. However, the increase in the volume of raw sequence that can be produced from these sequencers is threatening to swamp our available data archives, because genomics centers are gearing up to produce much more data in the next several years. For example, major National Institutes of Health (NIH) sequencing centers are planning to sequence 100 complete human genomes in the next 2-3 years [1]. Furthermore, the increased throughput of the new sequencing machines makes it possible for biologists to sequence large numbers of bacterial strains and isolates, leading some microbiologists to suggest that we characterize the genomes of all organisms present in culture collections.These technologies greatly increase sequencing throughput by laying out millions of DNA fragments on a single chip and sequencing all these fragments in parallel. The various technologies differ in the procedures used to array the DNA fragments: 454 and Applied Biosystems first attach the DNA to coated beads, whereas Solexa and Helicos attach the DNA directly to the chip. (For a more detailed description of these technologies, see the companion article by Mardis [2].)