Obtaining accurate sequences from long DNA molecules is very important for genome assembly and other applications. Here we describe single tube long fragment read (stLFR), a technology that enables this a low cost. It is based on adding the same barcode sequence to sub-fragments of the original long DNA molecule (DNA co-barcoding). To achieve this efficiently, stLFR uses the surface of microbeads to create millions of miniaturized barcoding reactions in a single tube. Using a combinatorial process up to 3.6 billion unique barcode sequences were generated on beads, enabling practically non-redundant co-barcoding with 50 million barcodes per sample. Using stLFR, we demonstrate efficient unique co-barcoding of over 8 million 20-300 kb genomic DNA fragments. Analysis of the genome of the human genome NA12878 with stLFR demonstrated high quality variant calling and phasing into contigs up to N50 34 Mb. We also demonstrate detection of complex structural variants and complete diploid de novo assembly of NA12878. These analyses were all performed using single stLFR libraries and their construction did not significantly add to the time or cost of whole genome sequencing (WGS) library preparation. stLFR represents an easily automatable solution that enables high quality sequencing, phasing, SV detection, scaffolding, cost-effective diploid de novo genome assembly, and other long DNA sequencing applications.Numerous technologies, including direct single molecule sequencing (Levene et al. recently been developed to generate at least some of this information. Most are based on the process of co-barcoding (Peters et al. 2014), that is, the addition of the same barcode to the sub-fragments of single long genomic DNA molecules. After sequencing the barcode information can be used to determine which reads are derived from the original long DNA molecule. This process was first described by Drmanac (Drmanac 2006) and implemented as a 384-well plate assay by Peters et al. (Peters et al. 2012). These approaches have been technically challenging to implement, are expensive, have lower data quality, do not analyze individual DNA molecules separately (i.e., do not provide unique co-barcoding), or some combination of all four. In practice, most require a separate whole genome sequence to be generated by standard methods to improve variant calling. In addition, most can only provide haplotype information, but are unable to provide the other additional information necessary for perfect genome sequencing.
Results
stLFR library processHere we describe implementation of stLFR technology (Drmanac 2013), an efficient approach for DNA co-barcoding with millions of barcodes enabled in a single tube. This is achieved by using the surface of a microbead as a replacement for a compartment (e.g., the well of a 384-well plate). Each bead carries many copies of a unique barcode sequence which is transferred to the sub-fragments of each long DNA molecule. These co-barcoded sub-fragments are then analyzed on common short read sequencing devices such as ...