Accurate identification of genetic variants from next-generation sequencing (NGS) data is essential for immediate largescale genomic endeavors such as the 1000 Genomes Project, and is crucial for further genetic analysis based on the discoveries. The key challenge in single nucleotide polymorphism (SNP) discovery is to distinguish true individual variants (occurring at a low frequency) from sequencing errors (often occurring at frequencies orders of magnitude higher). Therefore, knowledge of the error probabilities of base calls is essential. We have developed Atlas-SNP2, a computational tool that detects and accounts for systematic sequencing errors caused by context-related variables in a logistic regression model learned from training data sets. Subsequently, it estimates the posterior error probability for each substitution through a Bayesian formula that integrates prior knowledge of the overall sequencing error probability and the estimated SNP rate with the results from the logistic regression model for the given substitutions. The estimated posterior SNP probability can be used to distinguish true SNPs from sequencing errors. Validation results show that Atlas-SNP2 achieves a false-positive rate of lower than 10%, with an~5% or lower false-negative rate.[Supplemental material is available online at http://www.genome.org. Atlas-SNP2 and its documentation are available for download at http://www.hgsc.bcm.tmc.edu/cascade-tech-software-ti.hgsc.]In recent years, next-generation sequencing (NGS) technologies have propelled the rapid progress of genomics studies (Hillier et al. 2008;Srivatsan et al. 2008). Continuous improvement in NGS technologies are increasing the throughput while lowering costs, thus enabling ultra-large-scale sequencing efforts (Margulies et al. 2005;Shendure and Ji 2008). For example, the 1000 Genomes Project is aimed at sequencing more than 1000 human genomes to characterize the pattern of genetic variants (common and rare) in unprecedented detail (http://www.1000genomes.org/page.php) (Kaiser 2008). To realize this objective, it is essential that NGS technologies detect genomic variations accurately, including single nucleotide polymorphisms (SNPs), structural variations caused by insertions or deletions (indels), copy number variations (CNVs), and inversions or other rearrangements. However, the short read length and relatively high error rates present challenges to variant discovery from raw NGS data. While the error model for Sanger sequencing was well characterized (Ewing and Green 1998), systematic errors in NGS are not yet well studied, making it difficult to distinguish true genetic variations from the sequencing errors.Currently, there are several methods available for detecting SNPs from NGS data, including Pyrobayes , POLYBAYES (Marth et al. 1999), MAQ (Li et al. 2008), SOAP (Li et al. 2009), VarScan (Ley et al. 2008Koboldt et al. 2009), and other largely heuristic approaches (Wheeler et al. 2008). Pyrobayes-POLYBAYES recalibrates base-calling of all nucleotide positions from ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.