Although Kraken's k-mer-based approach provides fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed five-fold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.Assigning taxonomic labels to sequencing reads is an important part of many computational genomics pipelines for metagenomics projects. Recent years have seen several approaches to accomplish this task in a time-efficient manner 1-3 . Kraken 4 used a memory-intensive algorithm that associates short genomic substrings (k-mers) with lowest common ancestor (LCA) taxa. Kraken and related tools like KrakenUniq 5 have proven highly efficient and accurate in other tool comparisons 6,7 . But Kraken's high memory requirements force many researchers to either use a reduced-sensitivity MiniKraken database 8,9 , or to build and use many indexes over subsets of the reference sequences 10,11 . Its memory requirements can easily exceed 100 GB 7 , especially when the reference data includes large eukaryotic genomes 12,13 . Here we introduce Kraken 2, which provides a major reduction in memory usage as well as faster classification, a spaced-seed searching scheme, a translated search mode for matching in amino acid space, and continued compatibility with the Bracken 14 species-level quantification algorithm.Kraken 2 addresses the issue of large memory requirements through two changes to Kraken 1's data structures and algorithms. While Kraken 1 used a sorted list of k-mer/LCA pairs indexed by minimizers 15 , Kraken 2 introduces a probabilistic, compact hash table to map minimizers to LCAs. This table uses one-third of the memory of a standard hash table, at the cost of some specificity and accuracy. Additionally, Kraken 2 only stores minimizers (of length ℓ, ℓ ≤ k) from the reference sequence library in the data structure, whereas Kraken 1's stored all k-mers. Kraken 2's index for a reference database consisting of 9.1 Gbp of genomic sequence uses 10.6 gigabytes of memory at classification time. Kraken 1's index for the same reference set uses 72.4 gigabytes of memory for classification (Figure 1a, Supplementary Table S1). In general, a Kraken 2 database is about 15% as large as a Kraken 1 database over the same references (Supplementary Figure S1).Kraken 2's approach is faster than Kraken 1's because only the distinct minimizers from the query (read) trigger accesses to the hash table. A similar minimizer-based approach has proven useful in accelerating read alignment 16 . Kraken 2 additionally provides a hash-based filtering approach that subsamples the set of minimizer/LCA pairs included in the table, allowing the user to specify a target hash table size; smaller hash tables yield lower memory footprint and higher classification throughput at the expens...
As sequencing throughput approaches dozens of gigabases per day, there is a growing need for efficient software for analysis of transcriptome sequencing (RNA-Seq) data. Myrna is a cloud-computing pipeline for calculating differential gene expression in large RNA-Seq datasets. We apply Myrna to the analysis of publicly available data sets and assess the goodness of fit of standard statistical models. Myrna is available from http://bowtie-bio.sf.net/myrna.
We present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new analysis pipeline. To facilitate access to the data, we provide the and R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features. can be used to process local and/or private data, allowing results to be directly compared to any study in recount3. Taken together, our tools help biologists maximize the utility of publicly available RNA-seq data, especially to improve their understanding of newly collected data. recount3 is available from http://rna.recount.bio.
We present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new Monorail analysis pipeline. To facilitate access to the data, we provide the recount3 and snapcount R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features. Monorail can be used to process local and/or private data, allowing results to be directly compared to any study in recount3. Taken together, our tools help biologists maximize the utility of publicly available RNA-seq data, especially to improve their understanding of newly collected data. recount3 is available from http://rna.recount.bio.
We characterize the landscape of somatic mutations—mutations occurring after fertilization—in the human brain using ultra-deep (~250X) whole-genome sequencing of prefrontal cortex from 59 autism spectrum disorder (ASD) cases and 15 controls. We observe a mean of 26 somatic single nucleotide variants (sSNVs) per brain present in ≥4% of cells, with enrichment of mutations in coding and putative regulatory regions. Our analysis reveals that the first cell division after fertilization produces ~3.4 mutations, followed by 2–3 mutations in subsequent generations. This suggests that a typical individual possesses ~80 sSNVs present in ≥2% of cells—comparable to the number of de novo germline mutations per generation—with about half of individuals having at least one potentially function-altering somatic mutation somewhere in the cortex. ASD brains show an excess of somatic mutations in neural enhancer sequences compared to controls, suggesting that mosaic enhancer mutations may contribute to ASD risk.
General-purpose processors can now contain many dozens of processor cores and support hundreds of simultaneous threads of execution. To make best use of these threads, genomics software must contend with new and subtle computer architecture issues. We discuss some of these and propose methods for improving thread scaling in tools that analyze each read independently, such as read aligners. We implement these methods in new versions of Bowtie, Bowtie 2 and HISAT. We greatly improve thread scaling in many scenarios, including on the recent Intel Xeon Phi architecture. We also highlight how bottlenecks are exacerbated by variablerecord-length file formats like FASTQ and suggest changes that enable superior scaling.
1Public archives of next-generation sequencing data are growing exponentially, but the difficulty of 2 marshaling this data has led to its underutilization by scientists. Here we present ASCOT, a resource that 3 allows researchers to summarize, visualize, and query alternative splicing patterns in public RNA-Seq 4 data. ASCOT enables rapid identification of splice-variants across tens of thousands of bulk and single-5 cell RNA-Seq datasets in human and mouse. To demonstrate the utility of ASCOT, we first focused on the 6 nervous system and identified many alternative exons used only by a single neuronal subtype. We then 7 leveraged datasets from the ENCODE and GTEx consortiums to study the unique splicing patterns of rod 8 photoreceptors and found that PTBP1 knockdown combined with overexpression of MSI1 and PCBP2 9 activates rod-specific exons in HepG2 liver cancer cells. Furthermore, we observed that MSI1 targets 10 intronic UAG motifs proximal to the 5' splice site and interacts synergistically with PTBP1 11 downregulation. Finally, we show that knockdown of MSI1 in the retina abolishes rod-specific splicing. 12This work demonstrates how large-scale analysis of public RNA-Seq datasets can yield key insights into 13 cell type-specific control of RNA splicing and underscores the importance of considering both annotated 14 and unannotated splicing events. ASCOT splicing and gene expression data tables, software, and 15 interactive browser are available at http://ascot.cs.jhu.edu. 16 17
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.