Third generation single molecule sequencing technology is poised to revolutionize genomics by enabling the sequencing of long, individual molecules of DNA and RNA. These technologies now routinely produce reads exceeding 5,000 basepairs, and can achieve reads as long as 50,000 basepairs. Here we evaluate the limits of single molecule sequencing by assessing the impact of long read sequencing in the assembly of the human genome and 25 other important genomes across the tree of life. From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction. We apply it several prokaryotic and eukaryotic genomes, and show it can achieve near-perfect assemblies of small genomes (< 100Mbp) and substantially improved assemblies of larger ones. All source code and the assembly model are available open-source.
Summary: Long read sequencing platforms, which include the widely used Pacific Biosciences (PacBio) platform and the emerging Oxford Nanopore platform, aim to produce sequence fragments in excess of 15-20 kilobases, and have proved advantageous in the identification of structural variants and easing genome assembly. However, long read sequencing remains relatively expensive and error prone, and failed sequencing runs represent a significant problem for genomics core facilities. To quantitatively assess the underlying mechanics of sequencing failure, it is essential to have highly reproducible and controllable reference data sets to which sequencing results can be compared. Here, we present SiLiCO, the first in silico simulation tool to generate standardized sequencing results from both of the leading long read sequencing platforms. Availability: SiLiCO is an open source package written in Python. It is freely available at
Tracking data flow in high throughput sequencing is important in maintaining a consistent number of successfully sequenced samples, making decisions on scheduling the flow of sequencing steps, resolving problems at various steps and tracking the status of different projects. This is especially critical when the laboratory is handling a multitude of projects. We have built a Web-based data flow tracking package, called Kaleidaseq, which allows us to monitor the flow and quality of sequencing samples through the steps of preparation of library plates, plaque-picking, preparation of templates, conducting sequencing reactions, loading of samples on gels, base-calling the traces, and calculating the quality of the sequenced samples. Kaleidaseq's suite of displays allows for outstanding monitoring of the production sequencing process. The online display of current information that Kaleidaseq provides on both project status and process queues sorted by project enables accurate real-time assessment of the necessary samples that must be processed to complete the project. This information allows the process manager to allocate future resources optimally and schedule tasks according to scientific priorities. Quality of the sequenced samples can be tracked on a daily basis, which allows the sequencing laboratory to maintain a steady performance level and quickly resolve dips in quality. Kaleidaseq has a simple easy-to-use interface that allows access to all major functions and process queues from one Web page. This software package is modular and designed to allow additional processing steps and new monitoring variables to be added and tracked with ease. Access to the underlying relational database is through the Perl DBI interface, which allows for the use of different relational databases. Kaleidaseq is available for free use by the academic community from http://www.cshl.org/kaleidaseq.With the scale-up of sequencing efforts to unprecedented levels in genome centers around the world (Boguski et al. 1996;Marshall and Pennisi 1996), the need for scalable information systems to keep track of the flow of samples through the production pipeline of sequencing becomes more and more critical. Our laboratory is a member of a consortium whose mission is to sequence >6 Mb of chromosomes IV and V of Arabidopsis thaliana. Currently, our laboratory is processing 1000-2000 samples per week. This may increase in the future. Typically in the production pipeline of a large-scale sequencing facility, the samples go through the steps of making subclone libraries, making agar plates of bacteria infected with insert-carrying phage, picking of plaques, preparing template DNA for sequencing, conducting sequencing reactions, gel loading, calling of bases, and monitoring the quality of the sequenced samples. Several large insert clones, such as bacterial artificial chromosomes (BACs), as well as libraries of sequence-tagged site (STSs), expressed sequence tags (ESTs), and cDNAs, are sequenced concurrently. Typically, there are lag times of days b...
Motivation: Exome sequencing is a powerful technique for the identification of disease-causing genes. A number of Mendelian inherited disease genes have been identified through this method. However, it remains a challenge to leverage exome sequencing for the study of complex disorders, such as schizophrenia and bipolar disorder, due to the genetic and phenotypic heterogeneity of these disorders. Although not feasible for many studies, sequencing large sample sizes (>10,000) may improve statistical power to associate more variants, while the aggregation of distinct rare variants associated with a given disease can make the identification of causal genes statistically challenging. Therefore, new methods for rare variant association are imperative to identify causative genes of complex disorders. Results: Here we propose a method to predict causative rare variants using a popular probabilistic problem: The Birthday Model, which estimates the probability that multiple individuals in a group share the same birthday. We consider the probability and coincidence of samples sharing a variant akin to the chance of individuals sharing the same birthday. We investigated the parameter effects of our model, providing guidelines for its use and interpretation of the results. Using published data on autism spectrum disorder, hypertriglyceridemia in addition to a current case-control study on bipolar disorder, we evaluated this probabilistic method to identify potential causative variants. Several genes in the top results of the case-control study were associated with autism spectrum and bipolar disorder. Given that the core probability based on the birthday model is very sensitive to low recurrence, the method successfully tests the association of rare variants, which generally do not provide enough signal in commonly used statistical tests. Importantly, the simplicity of the model allows quick interpretation of genomic data, enabling users to select gene candidates for further biological validation of specific mutations and downstream functional or other studies. Availability: https://github.com/yberstein/Birthday-Algorithm
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.