High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in well-annotated mammalian species. The advances in sequencing technology have created a need for studies and tools that can characterize these novel variants. Here, we present SQANTI, an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline using 47 unique descriptors. We apply SQANTI to a neuronal mouse transcriptome using Pacific Biosciences (PacBio) long reads and illustrate how the tool is effective in characterizing and describing the composition of the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, resulting more frequently in novel ORFs than novel UTRs, and are enriched in both general metabolic and neural-specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read-based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases, we find that alternative isoforms are elusive to proteogenomics detection. SQANTI allows the user to maximize the analytical outcome of long-read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes.
(292 words)
22High-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of 23 thousands of novel transcripts, even in very well annotated organisms as mice and humans. Nonetheless, there is a 24 need for studies and tools that characterize these novel isoforms. Here we present SQANTI, an automated pipeline 25 for the classification of long-read transcripts that computes 47 descriptors that can be used to assess the quality of 26 the data and of the preprocessing pipelines. We applied SQANTI to a neuronal mouse transcriptome using PacBio 27 long reads and illustrate how the tool is effective in readily describing the composition of and characterizing the full-28 length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an 29 important number of the novel transcripts are technical artifacts of the sequencing approach, and that SQANTI 30 quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated 31 transcriptome are novel combinations of existing splice sites, result more frequently in novel ORFs than novel UTRs 32 and are enriched in both general metabolic and neural specific functions. We show that these new transcripts have a 33 major impact in the correct quantification of transcript levels by state-of-the-art short-read based quantification 34 algorithms. By comparing our iso-transcriptome with public proteomics databases we find that alternative isoforms
Recent advances in long-read sequencing solve inaccuracies in alternative transcript identification of full-length transcripts in short-read RNA-Seq data, which encourages the development of methods for isoform-centered functional analysis. Here, we present tappAS, the first framework to enable a comprehensive Functional Iso-Transcriptomics (FIT) analysis, which is effective at revealing the functional impact of context-specific post-transcriptional regulation. tappAS uses isoform-resolved annotation of coding and non-coding functional domains, motifs, and sites, in combination with novel analysis methods to interrogate different aspects of the functional readout of transcript variants and isoform regulation. tappAS software and documentation are available at https://app.tappas.org.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.