Identifying molecular cancer drivers is critical for precision oncology. Multiple advanced algorithms to identify drivers now exist, but systematic attempts to combine and optimize them on large datasets are few. We report a PanCancer and PanSoftware analysis spanning 9,423 tumor exomes (comprising all 33 of The Cancer Genome Atlas projects) and using 26 computational tools to catalog driver genes and mutations. We identify 299 driver genes with implications regarding their anatomical sites and cancer/cell types. Sequence- and structure-based analyses identified >3,400 putative missense driver mutations supported by multiple lines of evidence. Experimental validation confirmed 60%-85% of predicted mutations as likely drivers. We found that >300 MSI tumors are associated with high PD-1/PD-L1, and 57% of tumors analyzed harbor putative clinically actionable events. Our study represents the most comprehensive discovery of cancer genes and mutations to date and will serve as a blueprint for future biological and clinical endeavors.
The assembly of long reads from Pacific Biosciences and Oxford Nanopore Technologies typically requires resource-intensive error-correction and consensus-generation steps to obtain high-quality assemblies. We show that the error-correction step can be omitted and that high-quality consensus sequences can be generated efficiently with a SIMD-accelerated, partial-order alignment-based, stand-alone consensus module called Racon. Based on tests with PacBio and Oxford Nanopore data sets, we show that Racon coupled with miniasm enables consensus genomes with similar or better quality than state-of-the-art methods while being an order of magnitude faster.
Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them.We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/.
The study of cell-population heterogeneity in a range of biological systems, from viruses to bacterial isolates to tumor samples, has been transformed by recent advances in sequencing throughput. While the high-coverage afforded can be used, in principle, to identify very rare variants in a population, existing ad hoc approaches frequently fail to distinguish true variants from sequencing errors. We report a method (LoFreq) that models sequencing run-specific error rates to accurately call variants occurring in <0.05% of a population. Using simulated and real datasets (viral, bacterial and human), we show that LoFreq has near-perfect specificity, with significantly improved sensitivity compared with existing methods and can efficiently analyze deep Illumina sequencing datasets without resorting to approximations or heuristics. We also present experimental validation for LoFreq on two different platforms (Fluidigm and Sequenom) and its application to call rare somatic variants from exome sequencing datasets for gastric cancer. Source code and executables for LoFreq are freely available at http://sourceforge.net/projects/lofreq/.
In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on datasets of unprecedented complexity and realism. Benchmark metagenomes were generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups. Across all datasets, assembly and genome binning programs performed well for species represented by individual genomes, while performance was substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below the family level. Parameter settings substantially impacted performances, underscoring the importance of program reproducibility. While highlighting current challenges in computational metagenomics, the CAMI results provide a roadmap for software selection to answer specific research questions.
Citrus is a large genus that includes several major cultivated species, including C. sinensis (sweet orange), Citrus reticulata (tangerine and mandarin), Citrus limon (lemon), Citrus grandis (pummelo) and Citrus paradisi (grapefruit). In 2009, the global citrus acreage was 9 million hectares and citrus production was 122.3 million tons (FAO statistics, see URLs), which is the top ranked among all the fruit crops. Among the 10.9 million tons (valued at $9.3 billion) of citrus products traded in 2009, sweet orange accounted for approximately 60% of citrus production for both fresh fruit and processed juice consumption (FAO statistics, see URLs). Moreover, citrus fruits and juice are the prime human source of vitamin C, an important component of human nutrition.Citrus fruits also have some unique botanical features, such as nucellar embryony (nucellus cells can develop into apomictic embryos that are genetically identical to mother plant). Consequently, somatic embryos grow much more vigorously than the zygotic embryos in seeds such that seedlings are essentially clones of the maternal parent. Such citrus-unique characteristics have hindered the study of citrus genetics and breeding improvement 1,2 . Complete genome sequences would provide valuable genetic resources for improving citrus crops.Citrus is believed to be native to southeast Asia 3-5 , and cultivation of fruit crops occurred at least 4,000 years ago 3,6 . The genetic origin of the sweet orange is not clear, although there are some speculations that sweet orange might be derived from interspecific hybridization of some primitive citrus species 7,8 . Citrus is also in the order Sapindales, a sister order to the Brassicales in the Malvidae, making it valuable for comparative genomics studies with the model plant Arabidopsis.We aimed to sequence the genome of Valencia sweet orange (C. sinensis cv. Valencia), one of the most important sweet orange varieties cultivated worldwide and grown primarily for orange juice production. Normal sweet oranges are diploids, with nine pairs of chromosomes and an estimated genome size of ~367 Mb 9 . To reduce the complexity of the sequenced genome, we obtained a doublehaploid (dihaploid) line derived from the anther culture of Valencia sweet orange 10 . We first generated whole-genome shotgun pairedend-tag sequence reads from the dihaploid genomic DNA and built a de novo assembly as the citrus reference genome; we then produced shotgun sequencing reads from the parental diploid DNA and mapped the sequences to the haploid reference genome to obtain the complete genome information for Valencia sweet orange. In addition, we conducted comprehensive transcriptome sequencing analyses for four representative tissues using shotgun RNA sequencing (RNA-Seq) to capture all transcribed sequences and paired-end-tag RNA sequencing (RNA-PET) to demarcate the 5′ and 3′ ends of all transcripts. On the basis of the DNA and RNA sequencing data, we characterized the orange genome for its gene content, heterozygosity and evolutionary features. ...
We describe genome mapping on nanochannel arrays. In this approach, specific sequence motifs in single DNA molecules are fluorescently labeled, and the DNA molecules are uniformly stretched in thousands of silicon channels on a nanofluidic device. Fluorescence imaging allows the construction of maps of the physical distances between occurrences of the sequence motifs. We demonstrate the analysis, individually and as mixtures, of 95 bacterial artificial chromosome (BAC) clones that cover the 4.7-Mb human major histocompatibility complex region. We obtain accurate, haplotype-resolved, sequence motif maps hundreds of kilobases in length, resulting in a median coverage of 114× for the BACs. The final sequence motif map assembly contains three contigs. With an average distance of 9 kb between labels, we detect 22 haplotype differences. We also use the sequence motif maps to provide scaffolds for de novo assembly of sequencing data. Nanochannel genome mapping should facilitate de novo assembly of sequencing reads from complex regions in diploid organisms, haplotype and structural variation analysis and comparative genomics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.