We introduce Salmon, a method for quantifying transcript abundance from RNA-seq reads that is accurate and fast. Salmon is the first transcriptome-wide quantifier to correct for fragment GC content bias, which we demonstrate substantially improves the accuracy of abundance estimates and the reliability of subsequent differential expression analysis. Salmon combines a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure.
We introduce Sailfish, a computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data. Because Sailfish entirely avoids mapping reads, a time-consuming step in all current methods, it provides quantification estimates much faster than do existing approaches (typically 20 times faster) without loss of accuracy. By facilitating frequent reanalysis of data and reducing the need to optimize parameters, Sailfish exemplifies the potential of lightweight algorithms for efficiently processing sequencing reads.
TransRate is a tool for reference-free quality assessment of de novo transcriptome assemblies. Using only the sequenced reads and the assembly as input, we show that multiple common artifacts of de novo transcriptome assembly can be readily detected. These include chimeras, structural errors, incomplete assembly, and base errors. TransRate evaluates these errors to produce a diagnostic quality score for each contig, and these contig scores are integrated to evaluate whole assemblies. Thus, TransRate can be used for de novo assembly filtering and optimization as well as comparison of assemblies generated using different methods from the same input reads. Applying the method to a data set of 155 published de novo transcriptome assemblies, we deconstruct the contribution that assembly method, read length, read quantity, and read quality make to the accuracy of de novo transcriptome assemblies and reveal that variance in the quality of the input data explains 43% of the variance in the quality of published de novo transcriptome assemblies. Because TransRate is reference-free, it is suitable for assessment of assemblies of all types of RNA, including assemblies of long noncoding RNA, rRNA, mRNA, and mixed RNA samples.
Chromosome conformation capture experiments have led to the discovery of dense, contiguous, megabase-sized topological domains that are similar across cell types and conserved across species. These domains are strongly correlated with a number of chromatin markers and have since been included in a number of analyses. However, functionally-relevant domains may exist at multiple length scales. We introduce a new and efficient algorithm that is able to capture persistent domains across various resolutions by adjusting a single scale parameter. The ensemble of domains we identify allows us to quantify the degree to which the domain structure is hierarchical as opposed to overlapping, and our analysis reveals a pronounced hierarchical structure in which larger stable domains tend to completely contain smaller domains. The identified novel domains are substantially different from domains reported previously and are highly enriched for insulating factor CTCF binding and histone marks at the boundaries.
Existing methods for quantifying transcript abundance require a fundamental compromise: either use high quality read alignments and experiment-specific models or sacrifice them for speed. We introduce Salmon, a quantification method that overcomes this restriction by combining a novel 'lightweight' alignment procedure with a streaming parallel inference algorithm and a feature-rich bias model. These innovations yield both exceptional accuracy and order-of-magnitude speed benefits over traditional alignment-based methods.Estimating transcript abundance across cell types, species, and conditions is a fundamental task in genomics. For example, these estimates are used for the classification of diseases and their subtypes [1], for understanding expression changes during development [2], and tracking the progression of cancer [3]. Efficient quantification of transcript abundance from RNA-seq data is an especially pressing problem due to the exponentially increasing number of experiments and the * rob.growing adoption of expression data for medical diagnosis [4]. However, various methods that address this problem achieve accurate results at the cost of requiring significant computational resources and do not scale well with the rate at which data is produced [5]. The recently developed quantification tool Sailfish [6] achieves an order of magnitude speed improvement over previous approaches, but Sailfish can sometimes produce slightly less accurate estimates for paired-end data or for stranded protocols and does not take advantage of high quality alignment information and experiment-specific models.We introduce a quantification procedure, called Salmon ( Supplementary Fig. 1), that achieves best-in-class accuracy, takes advantage of high quality alignment information and experiment-specific models and provides the same order-of-magnitude speed benefits as Sailfish.Using synthetic data from both the RSEM simulator [7] and the Flux Simulator [8] as well as experimental quantitative PCR data [9], we show that Salmon generally outperforms Sailfish and eXpress [10] with respect to accuracy ( Fig. 1a-b,e; Supplementary Tables 1&2) and is also faster than Sailfish (Fig. 1c). The transcript abundance estimation problem is particularly difficult for genes with many isoforms since reads derived from these genes can map to many more transcripts, and we find that Salmon is also generally more accurate in this case (Fig. 1d). Salmon is designed to run in parallel so that the procedure scales better with the number of reads in an experiment. Salmon can quantify abundance either via a lightweight alignment procedure (Online methods, Lightweight alignment and Supplementary Fig. 2), or using pre-computed alignments provided in SAM or BAM format -we find that the quantification accuracy is robust to this choice of input ( Supplementary Fig. 3). Salmon is also typically more accurate than a recent unpublished procedure Kallisto ( Supplementary Figs. 4&5, Supplementary Table 1).An innovation contributing to Salmon's speed and accuracy is ...
We introduce alevin, a fast end-to-end pipeline to process droplet-based single-cell RNA sequencing data, performing cell barcode detection, read mapping, unique molecular identifier (UMI) deduplication, gene count estimation, and cell barcode whitelisting. Alevin’s approach to UMI deduplication considers transcript-level constraints on the molecules from which UMIs may have arisen and accounts for both gene-unique reads and reads that multimap between genes. This addresses the inherent bias in existing tools which discard gene-ambiguous reads and improves the accuracy of gene abundance estimates. Alevin is considerably faster, typically eight times, than existing gene quantification approaches, while also using less memory. Electronic supplementary material The online version of this article (10.1186/s13059-019-1670-y) contains supplementary material, which is available to authorized users.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.