6Algorithms to infer isoform expression abundance from RNA-seq have been greatly improved in 7 accuracy during the past ten years. However, due to incomplete reference transcriptomes, mapping er-8 rors, incomplete sequencing bias models, or mistakes made by the algorithm, the quantification model 9 sometimes could not explain all aspects of the input read data, and misquantification can occur. Here, we 10 develop a computational method to detect instances where a quantification model could not thoroughly 11 explain the input. Specifically, our approach identifies transcripts where the read coverage has significant 12 deviations from the expectation. We call these transcripts "expression anomalies", and they represent in-13 stances where the quantification estimates may be in doubt. We further develop a method to attribute 14 the cause of anomalies to either the incompleteness of the reference transcriptome or the algorithmic 15 mistakes, and we show that our method precisely detects misquantifications with both causes. By cor- 16 recting the misquantifications that are labeled as algorithmic mistakes, the number of false predictions of 17 differentially expressed transcripts can be reduced. Applying anomaly detection to 30 GEUVADIS and 18 16 Human Body Map samples, we detect 103 genes with potential unannotated isoforms. These genes 19 tend to be longer than average, and contain a very long exon near 3 end that the unannotated isoform 20 excludes. Anomaly detection is a new approach for investigating the expression quantification problem 21 that may find wider use in other areas of genomics. 22 While modern RNA-seq quantification algorithms [e.g. 1-7] often achieve high accuracy, there remain situ-25 ations where they give erroneous quantifications. For example, most quantifiers rely on a predetermined set 26 of possible transcripts; missing or incorrect transcripts may cause incorrect quantifications. Read mapping 27 mistakes and unexpected sequencing artifacts introducing technical biases also lead to misquantifications.
28Incomplete sequencing bias models can mislead the probability calculation of which transcripts generate the 29 reads. Quantification algorithms themselves could introduce errors since their objectives cannot typically be 30 guaranteed to be solved optimally in a practical amount of time.
31When interpreting an expression experiment, particularly when a few specific genes are of interest, the 32 possibility of misquantification must be taken into account before inferences are made from quantifica-33 tion estimations or differential gene expression predictions derived from those quantifications. Expression 34 quantification is the basis for various analyses, such as differential gene expression [8], co-expression infer-35 ence [9], disease diagnosis and various computational prediction tasks [e.g., 10-12]. Statistical techniques 36 such as bootstrapping [13] and Gibbs sampling [1, 14, 15] can associate confidence intervals to expression 37 estimates, but these techniques provide little ...