Mohsen Zakeri scite author profile

Background The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. Results We investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. Conclusion We observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly, and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.

show abstract

Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data

Zakeri

Sarkar

et al. 2022

Nat Methods

View full text Add to dashboard Cite

The rapid growth of high-throughput single-cell and singlenucleus RNA sequencing technologies has produced a wealth of data over the past few years. The available technologies continue to evolve and experiments continue to increase in both number and scale. The size, volume, and distinctive characteristics of these data necessitate the development of new software and associated computational methods to accurately and efficiently quantify single-cell and single-nucleus RNA-seq data into count matrices that constitute the input to downstream analyses.We introduce the alevin-fry framework for quantifying single-cell and single-nucleus RNA-seq data. Despite being faster and more memory frugal than other accurate and scalable quantification approaches, alevin-fry does not suffer from the false positive expression or memory scalability issues that are exhibited by other lightweight tools. We demonstrate how alevin-fry can be effectively used to quantify single-cell and single-nucleus RNA-seq data, and also how the spliced and unspliced molecule quantification required as input for RNA velocity analyses can be seamlessly extracted from the same preprocessed data used to generate regular gene expression count matrices.

show abstract

Alignment and mapping methodology influence transcript abundance estimation

Srivastava

Malik

Sarkar

et al. 2019

Preprint

View full text Add to dashboard Cite

Background: The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy.Results: We investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large, and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally-acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment.Conclusion: We observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification. * Contributed equally.

show abstract

PuffAligner: a fast, efficient and accurate aligner based on the Pufferfish index

Almodaresi

Zakeri

Patro

2021

View full text Add to dashboard Cite

Motivation Sequence alignment is one of the first steps in many modern genomic analyses, such as variant detection, transcript abundance estimation and metagenomic profiling. Unfortunately, it is often a computationally expensive procedure. As the quantity of data and wealth of different assays and applications continue to grow, the need for accurate and fast alignment tools that scale to large collections of reference sequences persists. Results In this paper, we introduce PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly-sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space, and accuracy tradeoffs made by different alignment tools, and provides a promising foundation on which to test new alignment ideas over large collections of sequences. Availability PuffAligner is a free and open-source software. It is implemented in C ++14 and can be obtained from https://github.com/COMBINE-lab/pufferfish/tree/cigar-strings Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Improved data-driven likelihood factorizations for transcript abundance estimation

Zakeri

Srivastava

Almodaresi

et al. 2017

View full text Add to dashboard Cite

MotivationMany methods for transcript-level abundance estimation reduce the computational burden associated with the iterative algorithms they use by adopting an approximate factorization of the likelihood function they optimize. This leads to considerably faster convergence of the optimization procedure, since each round of e.g. the EM algorithm, can execute much more quickly. However, these approximate factorizations of the likelihood function simplify calculations at the expense of discarding certain information that can be useful for accurate transcript abundance estimation.ResultsWe demonstrate that model simplifications (i.e. factorizations of the likelihood function) adopted by certain abundance estimation methods can lead to a diminished ability to accurately estimate the abundances of highly related transcripts. In particular, considering factorizations based on transcript-fragment compatibility alone can result in a loss of accuracy compared to the per-fragment, unsimplified model. However, we show that such shortcomings are not an inherent limitation of approximately factorizing the underlying likelihood function. By considering the appropriate conditional fragment probabilities, and adopting improved, data-driven factorizations of this likelihood, we demonstrate that such approaches can achieve accuracy nearly indistinguishable from methods that consider the complete (i.e. per-fragment) likelihood, while retaining the computational efficiently of the compatibility-based factorizations.Availability and implementationOur data-driven factorizations are incorporated into a branch of the Salmon transcript quantification tool: https://github.com/COMBINE-lab/salmon/tree/factorizations.Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Mohsen Zakeri

Alignment and mapping methodology influence transcript abundance estimation

Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data

Alignment and mapping methodology influence transcript abundance estimation

PuffAligner: a fast, efficient and accurate aligner based on the Pufferfish index

Improved data-driven likelihood factorizations for transcript abundance estimation

Contact Info

Product

Resources

About