Although mass spectrometry is well-suited to identifying thousands of possible protein post-translational modifications (PTMs), it has historically been biased towards just a few. To measure the entire set of PTMs across diverse proteomes, software must overcome the dual challenges of searching enormous search spaces and distinguishing correct from incorrect spectrum interpretations. Here, we describe TagGraph, a computational tool that overcomes both challenges with an unrestricted string-based search method that is as much as 350-fold faster than existing approaches, and a probabilistic validation model we optimized for PTM assignments. We applied TagGraph to a published human proteomic data set of 25 million mass spectra and tripled confident spectrum identifications compared its original analysis. We identified thousands of modification types on almost one million sites in the proteome. We show new contexts for highly abundant yet understudied PTMs such as proline hydroxylation, and its unexpected association with cancer mutations. By enabling broad PTM characterization TagGraph informs how their functions and regulation intersect.
Dependent on concise, predefined protein sequence databases, traditional search algorithms perform poorly when analyzing mass spectra derived from wholly uncharacterized protein products. Conversely, de novo peptide sequencing algorithms can interpret mass spectra without relying on reference databases. However, such algorithms have been difficult to apply to complex protein mixtures, in part due to a lack of methods for automatically validating de novo sequencing results. Here, we present novel metrics for benchmarking de novo sequencing algorithm performance on large-scale proteomics data sets and present a method for accurately calibrating false discovery rates on de novo results. We also present a novel algorithm (LADS) that leverages experimentally disambiguated fragmentation spectra to boost sequencing accuracy and sensitivity. LADS improves sequencing accuracy on longer peptides relative to that of other algorithms and improves discriminability of correct and incorrect sequences. Using these advancements, we demonstrate accurate de novo identification of peptide sequences not identifiable using database search-based approaches.
12Thousands of protein post-translational modifications (PTMs) dynamically impact nearly all 13 cellular functions. Mass spectrometry is well suited to PTM identification, but proteome-scale 14 analyses are biased towards PTMs with existing enrichment methods. To measure the full 15 landscape of PTM regulation, software must overcome two fundamental challenges: intractably 16 large search spaces and difficulty distinguishing correct from incorrect identifications. Here, we 17 describe TagGraph, software that overcomes both challenges with a string-based search 18 method orders of magnitude faster than current approaches, and probabilistic validation model 19 optimized for PTM assignments. When applied to a human proteome map, TagGraph tripled 20 confident identifications while revealing thousands of modification types on nearly one million 21 sites spanning the proteome. We expand known sites by orders of magnitude for highly 22 abundant yet understudied PTMs such as proline hydroxylation, and derive tissue-specific 23 insight into these PTMs' roles. TagGraph expands our ability to survey the full landscape of 24 PTM function and regulation. 25Conventional sequence database search tools cannot identify modified peptides unless they are 44 first anticipated by the researcher [20][21][22] . Search parameters including the number, kind, and 45 frequency of PTMs are usually chosen to strike a difficult compromise: considering larger 46 numbers of PTMs and other sequence variants is necessary for their identification, but doing so 47 exponentially increases the time needed to interpret MS/MS datasets, and decreases the ability 48 to distinguish correct from incorrect assignments 23 . To partially address this compromise, 49 strategies have been proposed to constrain the number of proteins being searched, protease 50 specificity rules, or the allowable types and numbers of PTMs 17,18,[24][25][26] . In practice, these 51 approaches only marginally decrease search times without clearly distinguishing correct from 52 incorrect PTM assignments 27 . Therefore, most have not been demonstrated on large, proteome-53 scale datasets 23 54Here, we describe TagGraph, a powerful computational tool that addresses two principle 55 challenges of searching very large sequence spaces. First, TagGraph leverages accurate de 56 novo mass spectrum interpretations 28,29 to rapidly search millions of possible sequences for a 57 match with an FM-index 30 data structure. This highly efficient search method makes modern 58 next-generation genome sequencing possible 31 , but has not been adapted to proteomics. By 59 combining it with a graph-based string reconciliation algorithm, TagGraph rapidly searches 60 MS/MS datasets without restrictions on number of proteins, PTMs, or protease specificity. This 61 strategy achieves speeds orders of magnitude faster than prior algorithms because it considers 62 exponentially more sequence possibilities without having to explicitly test each one against input 63 spectra. Second, by replacing conventional "ta...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.