The Probability Distribution for a Random Match Between an Experimental-Theoretical Spectral Pair in Tandem Mass Spectrometry

Fridman, Tema; Razumovskaya, Jane; VerBerkmoes, Nathan C.; Hurst, Gregory B.; Protopopescu, V.; Xu, Ying

doi:10.1142/s0219720005001120

Cited by 16 publications

(19 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In peptide identification research, database search techniques have been commonly used to select a candidate set of peptides based on the degree of matching between the "theoretical" (expected) mass spectra of candidate peptides in a protein database and the empirical spectra in the input sample [1], [2], [4], [5], [6], [7], [8]. The theoretical spectrum of each peptide can be automatically derived by rules from the amino acid sequences of proteins.…”

Section: Background and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling

Yang

Harpale

Ganapathy

2009

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. This paper presents an interdisciplinary investigation of statistical information retrieval (IR) techniques for protein identification from tandem mass spectra, a challenging problem in proteomic data analysis. We formulate the task as an IR problem, by constructing a "query vector" whose elements are system-predicted peptides with confidence scores based on spectrum analysis of the input sample, and by defining the vector space of "documents" with protein profiles, each of which is constructed based on the theoretical spectrum of a protein. This formulation establishes a new connection from the protein identification problem to a probabilistic language modeling approach as well as the vector space models in IR, and enables us to compare fundamental differences in the IR models and common approaches in protein identification. Our experiments on benchmark spectrometry query sets and large protein databases demonstrate that the IR models significantly outperform wellestablished methods in protein identification, by enhancing precision in highrecall regions in particular, where the conventional approaches are weak.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

“…Many technical solutions have been developed for the peptide identification step in the past two decades, including commercially available software [1], [2], [4], [5], [6], [7], [8]. However, for the second step, the current literature is relatively sparse.…”

Section: Introductionmentioning

confidence: 99%

Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling

Yang

Harpale

Ganapathy

2009

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

show abstract

“…The data generated from these machines is stochastic in nature and complex algorithms are required for post-processing of the raw data, e.g., phosphopeptide filtering [14], false positive rate estimation [4], quantification of proteins from large data sets [13], and phosphorylation site assignments [24], [26]. Other advanced methods include techniques to discriminate between different ions [32], estimating the probabilities of random match between an experimental-theoretical spectral pair [11], and identification of specific protein interactions using MS data [20]. As more high-throughput mass spectrometers are introduced, more efficient and novel computational tools are required to deal with these large data sets.…”

Section: Introductionmentioning

confidence: 99%

CAMS-RS: Clustering Algorithm for Large-Scale Mass Spectrometry Data Using Restricted Search Space and Intelligent Random Sampling

Saeed

Hoffert

Knepper

2014

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

High-throughput mass spectrometers can produce massive amounts of redundant data at an astonishing rate with many of them having poor signal-to-noise (S/N) ratio. These low S/N ratio spectra may not get interpreted using conventional spectra-to-database matching techniques. In this paper, we present an efficient algorithm, CAMS-RS (Clustering Algorithm for Mass Spectra using Restricted Space and Sampling) for clustering of raw mass spectrometry data. CAMS-RS utilizes a novel metric (called F-set) that exploits the temporal and spatial patterns to accurately assess similarity between two given spectra. The F-set similarity metric is independent of the retention time and allows clustering of mass spectrometry data from independent LC-MS/MS runs. A novel restricted search space strategy is devised to limit the comparisons of the number of spectra. An intelligent sampling method is executed on individual bins that allow merging of the results to make the final clusters. Our experiments, using experimentally generated data sets, show that the proposed algorithm is able to cluster spectra with high accuracy and is helpful in interpreting low S/N ratio spectra. The CAMS-RS algorithm is highly scalable with increasing number of spectra and our implementation allows clustering of up to a million spectra within minutes.

show abstract

“…In this approach, experimental MS/MS spectra are annotated by theoretically derived spec-tra predicted by peptides contained in a protein sequence database. Several database search tools are available, including SEQUEST (14), MASCOT (15), X!TANDEM (16), and others (17)(18)(19)(20). A current challenge for high-throughput proteomics is to use database search results from large numbers of MS/MS spectra to derive a list of identified peptides and their corresponding proteins.…”

mentioning

confidence: 99%

A Multivariate Mixture Model to Estimate the Accuracy of Glycosaminoglycan Identifications Made by Tandem Mass Spectrometry (MS/MS) and Database Search

Chiu

Schliekelman

Orlando

et al. 2017

Molecular & Cellular Proteomics

View full text Add to dashboard Cite

We present a statistical model to estimate the accuracy of derivatized heparin and heparan sulfate (HS) glycosaminoglycan (GAG) assignments to tandem mass (MS/MS) spectra made by the first published database search application, GAG-ID. Employing a multivariate expectationmaximization algorithm, this statistical model distinguishes correct from ambiguous and incorrect database search results when computing the probability that heparin/HS GAG assignments to spectra are correct based upon database search scores. Using GAG-ID search results for spectra generated from a defined mixture of 21 synthesized tetrasaccharide sequences as well as seven spectra of longer defined oligosaccharides, we demonstrate that the computed probabilities are accurate and have high power to discriminate between correctly, ambiguously, and incorrectly assigned heparin/HS GAGs. Heparin and heparan sulfate (HS), members of the glycosaminoglycan (GAG) family, are linear polysaccharides composed of repeating disaccharide building blocks of variously sulfated hexuronic acid (134) D-glucosamine units that structurally differ solely by the length of the oligosaccharide and degree of modification, with heparin being more heavily sulfated and having less N-acetylation than HS. Interacting with proteins, heparin/HS play essential roles in a wide variety of biological processes, including anticoagulation (1), cell proliferation (2, 3), and carcinogenesis (4, 5). The specificity of these interactions is driven by the pattern of modification of heparin/HS oligosaccharide sequences. To understand the molecular role of heparin/HS, it is necessary to correlate function with the fine structure of the carbohydrate. However, the non-template-driven biosynthesis of heparin/HS results in extremely diverse structures. Analyzing heparin/HS is challenging for three reasons: the presence of multiple isomeric sequences in a complex mixture of oligosaccharides, the difficulty of separating the isomers, and the facile loss of sulfates in MS/MS (6).We previously introduced a method for structurally sequencing heparin/HS oligosaccharides that involves chemical derivatizations to replace labile sulfates with stable acetyl groups (7). This derivatization scheme allows for the use of reverse-phase liquid chromatography (LC) for high-resolution separation of isomeric heparin/HS oligosaccharides and MS/MS for sequencing them. However, the data from this derivatization method cannot be easily incorporated into current glycomic software, such as GlycoWorkbench (8), due to the multistep derivatizations and lack of a scoring algorithm that accurately evaluates the matches. We recently reported the development of a software tool for the high-throughput analysis of LC-MS/MS data from these derivatized heparin/HS oligosaccharides, entitled GAG-ID (9), which is the first database-driven software package for this purpose. GAG-ID produces a GAG sequence assignment for each input spectrum; however, some assignments are true matches and some are false. False matches arise from low-qua...

show abstract

The Probability Distribution for a Random Match Between an Experimental-Theoretical Spectral Pair in Tandem Mass Spectrometry

Cited by 16 publications

References 26 publications

Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling

Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling

CAMS-RS: Clustering Algorithm for Large-Scale Mass Spectrometry Data Using Restricted Search Space and Intelligent Random Sampling

A Multivariate Mixture Model to Estimate the Accuracy of Glycosaminoglycan Identifications Made by Tandem Mass Spectrometry (MS/MS) and Database Search

Contact Info

Product

Resources

About