A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs

Soneson, Charlotte; Love, Michael I.; Patro, Rob; Hussain, Shobbir; Malhotra, Dheeraj; Robinson, Mark D.

doi:10.26508/lsa.201800175

Cited by 18 publications

(17 citation statements)

References 41 publications

(64 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…During the preparation of this manuscript, a preprint from Soneson et al . reported a similar observation and proposed the creation of a new index to flag such problematic genes 38 . While the current manuscript strongly emphasizes the role of 3’ UTRs in the emergence of estimation biases, we could pinpoint at least one example where 5’ UTRs play a similar role in the issue.…”

Section: Discussionmentioning

confidence: 62%

“…The use of the JCC (Junction Coverage Compatibility) score introduced by Soneson et al . will be greatly useful to prevent misinterpretation of transcriptomics studies in the future but will tie quantifications to the results of computationally demanding alignment methods 38 . Improvement of current genomic annotations might ultimately offer an alternative as they will allow for the sole use of fast quantification algorithms.…”

Section: Discussionmentioning

confidence: 99%

“…Improvement of current genomic annotations might ultimately offer an alternative as they will allow for the sole use of fast quantification algorithms. This might partially be achieved using transcript catalogues obtained from large scale studies such as CHESS 39 even though Soneson et al 38 reported very little to no improvement in their JCC scores using these new annotations.…”

Section: Discussionmentioning

confidence: 99%

See 2 more Smart Citations

Correction of gene model annotations improves isoform abundance estimates: the example of ketohexokinase (Khk)

et al. 2019

View full text Add to dashboard Cite

Next generation sequencing protocols such as RNA-seq have made the genome-wide characterization of the transcriptome a crucial part of many research projects in biology. Analyses of the resulting data provide key information on gene expression and in certain cases on exon or isoform usage. The emergence of transcript quantification software such as Salmon has enabled researchers to efficiently estimate isoform and gene expressions across the genome while tremendously reducing the necessary computational power. Although overall gene expression estimations were shown to be accurate, isoform expression quantifications appear to be a more challenging task. Low expression levels and uneven or insufficient coverage were reported as potential explanations for inconsistent estimates. Here, through the example of the ketohexokinase (Khk) gene in mouse, we demonstrate that the use of an incorrect gene annotation can also result in erroneous isoform quantification results. Manual correction of the input Khk gene model provided a much more accurate estimation of relative Khk isoform expression when compared to quantitative PCR (qPCR measurements). In particular, removal of an unexpressed retained intron and a proper adjustment of the 5’ and 3’ untranslated regions both had a strong impact on the correction of erroneous estimates. Finally, we observed a better concordance in isoform quantification between datasets and sequencing strategies when relying on the newly generated Khk annotations. These results highlight the importance of accurate gene models and annotations for correct isoform quantification and reassert the need for orthogonal methods of estimation of isoform expression to confirm important findings.

show abstract

Section: Discussionmentioning

confidence: 62%

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Correction of gene model annotations improves isoform abundance estimates: the example of ketohexokinase (Khk)

et al. 2019

View full text Add to dashboard Cite

show abstract

“…With the wide usage of gene or transcript expression data, we have seen improvement in probabilistic models of RNA-seq expression quantification [1][2][3][4], as well as characterization and evaluation of the errors of quantified expression [5][6][7]. However, there is an under-characterized type of estimation error which is due to the non-uniqueness of solutions to the probabilistic model.…”

Section: Introductionmentioning

confidence: 99%

Deriving Ranges of Optimal Estimated Transcript Expression Due to Non-identifiability

Zheng

Kingsford

2019

Preprint

View full text Add to dashboard Cite

Current expression quantification methods suffer from a fundamental but under-characterized type of error: the most likely estimates for transcript abundances are not unique. Current quantification methods rely on probabilistic models, and the scenario where it admits multiple optimal solutions is called nonidentifiability. This means multiple estimates of transcript abundances generate the observed RNA-seq reads equally likely, and the underlying true expression cannot be determined. The non-identifiability problem is further exacerbated when incompleteness of reference transcriptome and existence of unannotated transcripts are taken into consideration. State-of-the-art quantification methods usually output a single inferred set of abundances, and the accuracy of the single expression set is unknown compared to other equally optimal solutions. Obtaining the set of equally optimal solutions is necessary for evaluating and extending downstream analyses to take non-identifiability into account. We propose methods to compute the range of equally optimal estimates for the expression of each transcript, accounting for non-identifiability of the quantification model using several novel graph theoretical approaches. It works under two scenarios, one assuming the reference transcriptome is complete, another assuming incomplete reference and allowing for expression of unannotated transcripts. Our methods calculate a "confidence" range for each transcript, representing its possible abundance across equally optimal estimates. This range can be used for evaluating the reliability of detected differentially expressed (DE) transcripts, as a large overlap of confidence range between DE conditions indicates the prediction may be unreliable due to uncertainty. We observe that 5 out of 257 DE predictions are unreliable on an MCF10 cell line and 19 out of 3152 are unreliable on a CD8 T cell dataset. The source code can be found at https://github.com/Kingsford-Group/subgraphquant.

show abstract

“…In addition, external information about sequence similarity provides limited insight on how to improve the quantification models. Soneson et al [23] use a compatibility score of observed and predicted junction coverage to indicate genes with potential misquantification in its transcripts. With this anomaly score, it is possible to narrow down the misquantified transcripts by the anomalous splicing junctions.…”

Section: Introductionmentioning

confidence: 99%

Detecting anomalies in RNA-seq quantification

Kingsford

2019

Preprint

View full text Add to dashboard Cite

6Algorithms to infer isoform expression abundance from RNA-seq have been greatly improved in 7 accuracy during the past ten years. However, due to incomplete reference transcriptomes, mapping er-8 rors, incomplete sequencing bias models, or mistakes made by the algorithm, the quantification model 9 sometimes could not explain all aspects of the input read data, and misquantification can occur. Here, we 10 develop a computational method to detect instances where a quantification model could not thoroughly 11 explain the input. Specifically, our approach identifies transcripts where the read coverage has significant 12 deviations from the expectation. We call these transcripts "expression anomalies", and they represent in-13 stances where the quantification estimates may be in doubt. We further develop a method to attribute 14 the cause of anomalies to either the incompleteness of the reference transcriptome or the algorithmic 15 mistakes, and we show that our method precisely detects misquantifications with both causes. By cor- 16 recting the misquantifications that are labeled as algorithmic mistakes, the number of false predictions of 17 differentially expressed transcripts can be reduced. Applying anomaly detection to 30 GEUVADIS and 18 16 Human Body Map samples, we detect 103 genes with potential unannotated isoforms. These genes 19 tend to be longer than average, and contain a very long exon near 3 end that the unannotated isoform 20 excludes. Anomaly detection is a new approach for investigating the expression quantification problem 21 that may find wider use in other areas of genomics. 22 While modern RNA-seq quantification algorithms [e.g. 1-7] often achieve high accuracy, there remain situ-25 ations where they give erroneous quantifications. For example, most quantifiers rely on a predetermined set 26 of possible transcripts; missing or incorrect transcripts may cause incorrect quantifications. Read mapping 27 mistakes and unexpected sequencing artifacts introducing technical biases also lead to misquantifications. 28Incomplete sequencing bias models can mislead the probability calculation of which transcripts generate the 29 reads. Quantification algorithms themselves could introduce errors since their objectives cannot typically be 30 guaranteed to be solved optimally in a practical amount of time. 31When interpreting an expression experiment, particularly when a few specific genes are of interest, the 32 possibility of misquantification must be taken into account before inferences are made from quantifica-33 tion estimations or differential gene expression predictions derived from those quantifications. Expression 34 quantification is the basis for various analyses, such as differential gene expression [8], co-expression infer-35 ence [9], disease diagnosis and various computational prediction tasks [e.g., 10-12]. Statistical techniques 36 such as bootstrapping [13] and Gibbs sampling [1, 14, 15] can associate confidence intervals to expression 37 estimates, but these techniques provide little ...

show abstract

A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs

Abstract: Comparison of observed exon–exon junction counts to those predicted from estimated transcript abundances can identify genes with misannotated or misquantified transcripts.

Cited by 18 publications

References 41 publications

Correction of gene model annotations improves isoform abundance estimates: the example of ketohexokinase (Khk)

Correction of gene model annotations improves isoform abundance estimates: the example of ketohexokinase (Khk)

Deriving Ranges of Optimal Estimated Transcript Expression Due to Non-identifiability

Detecting anomalies in RNA-seq quantification

Contact Info

Product

Resources

About