Proteomic analysis of complex protein mixtures using proteolytic digestion and liquid chromatography in combination with tandem mass spectrometry is a standard approach in biological studies. Data-dependent acquisition is used to automatically acquire tandem mass spectra of peptides eluting into the mass spectrometer. In more complicated mixtures, for example, whole cell lysates, data-dependent acquisition incompletely samples among the peptide ions present rather than acquiring tandem mass spectra for all ions available. We analyzed the sampling process and developed a statistical model to accurately predict the level of sampling expected for mixtures of a specific complexity. The model also predicts how many analyses are required for saturated sampling of a complex protein mixture. For a yeast-soluble cell lysate 10 analyses are required to reach a 95% saturation level on protein identifications based on our model. The statistical model also suggests a relationship between the level of sampling observed for a protein and the relative abundance of the protein in the mixture. We demonstrate a linear dynamic range over 2 orders of magnitude by using the number of spectra (spectral sampling) acquired for each protein.
Large-scale genomics has enabled proteomics by creating sequence infrastructures that can be used with mass spectrometry data to identify proteins. Although protein sequences can be deduced from nucleotide sequences, posttranslational modifications to proteins, in general, cannot. We describe a process for the analysis of posttranslational modifications that is simple, robust, general, and can be applied to complicated protein mixtures. A protein or protein mixture is digested by using three different enzymes: one that cleaves in a site-specific manner and two others that cleave nonspecifically. The mixture of peptides is separated by multidimensional liquid chromatography and analyzed by a tandem mass spectrometer. This approach has been applied to modification analyses of proteins in a simple protein mixture, Cdc2p protein complexes isolated through the use of an affinity tag, and lens tissue from a patient with congenital cataracts. Phosphorylation sites have been detected with known stoichiometry of as low as 10%. Eighteen sites of four different types of modification have been detected on three of the five proteins in a simple mixture, three of which were previously unreported. Three proteins from Cdc2p isolated complexes yielded eight sites containing three different types of modifications. In the lens tissue, 270 proteins were identified, and 11 different crystallins were found to contain a total of 73 sites of modification. Modifications identified in the crystallin proteins included Ser, Thr, and Tyr phosphorylation, Arg and Lys methylation, Lys acetylation, and Met, Tyr, and Trp oxidations. The method presented will be useful in discovering co-and posttranslational modifications of proteins.
As the speed with which proteomic labs generate data increases along with the scale of projects they are undertaking, the resulting data storage and data processing problems will continue to challenge computational resources. This is especially true for shotgun proteomic techniques that can generate tens of thousands of spectra per instrument each day. One design factor leading to many of these problems is caused by storing spectra and the database identifications for a given spectrum as individual files. While these problems can be addressed by storing all of the spectra and search results in large relational databases, the infrastructure to implement such a strategy can be beyond the means of academic labs. We report here a series of unified text file formats for storing spectral data (MS1 and MS2) and search results (SQT) that are compact, easily parsed by both machine and humans, and yet flexible enough to be coupled with new algorithms and data-mining strategies.
Database searching is an essential element of large-scale proteomics. Because these methods are widely used, it is important to understand the rationale of the algorithms. Most algorithms are based on concepts first developed in SEQUEST and PeptideSearch. Four basic approaches are used to determine a match between a spectrum and sequence: descriptive, interpretative, stochastic and probability-based matching. We review the basic concepts used by most search algorithms, the computational modeling of peptide identification and current challenges and limitations of this approach for protein identification.An unintended consequence of whole-genome sequencing has been the birth of large-scale proteomics. What drives proteomics is the ability to use mass spectrometry data of peptides as an 'address' or 'zip code' to locate proteins in sequence databases. Two mass spectrometry methods are used to identify proteins by database search methods. The first method uses a molecular weight fingerprint measured from a protein digested with a site-specific protease [1][2][3][4][5] . A second method uses tandem mass spectra derived from individual peptides of a digested protein 6,7 (Fig. 1). Because each tandem mass spectrum represents an independent and verifiable piece of data, this approach to database searching has the ability to identify proteins in mixtures, enabling a rapid and comprehensive approach for the analysis of protein complexes and other complicated mixtures of proteins 6,[8][9][10][11][12] . New biology has been discovered based on fast and accurate protein identification [13][14][15][16][17][18] . As tandem mass spectral protein identification has proliferated, it has become increasingly important to understand the rationale of individual database search algorithms, their relative strengths and weaknesses, and the mathematics used to match sequence to spectrum.In this review we discuss the prevailing fragmentation models, spectral preprocessing, methods to match tandem mass spectra to sequences and several approaches to matching tandem mass spectra of peptides whose exact sequences may not be present in the database. Space limitations restrict a detailed description of all algorithms in this rapidly expanding field. Also, some algorithms are proprietary, and thus, details on how they work are unknown. This review should supplement and update earlier reviews on database search algorithms [19][20][21][22][23][24] . Peptide fragmentation and data preprocessingIn tandem mass spectrometry (MS/MS), gas phase peptide ions undergo collision-induced dissociation (CID) with molecules of an inert gas such as helium or argon 25 . Other methods of dissociation have been developed, such as electron capture dissociation (ECD), surface induced dissociation (SID) and electron transfer dissociation (ETD), but gas-phase CID is the most widely used in commercial tandem mass spectrometers. The dissociation pathways are strongly dependent on the collision energy, but the vast majority of instruments use low-energy CID (<100 eV) 26 ....
Quantitative shotgun proteomic analyses are facilitated using chemical tags such as ICAT and metabolic labeling strategies with stable isotopes. The rapid high-throughput production of quantitative "shotgun" proteomic data necessitates the development of software to automatically convert mass spectrometry-derived data of peptides into relative protein abundances. We describe a computer program called RelEx, which uses a least-squares regression for the calculation of the peptide ion current ratios from the mass spectrometry-derived ion chromatograms. RelEx is tolerant of poor signal-to-noise data and can automatically discard nonusable chromatograms and outlier ratios. We apply a simple correction for systematic errors that improves the accuracy of the quantitative measurement by 32 +/- 4%. Our automated approach was validated using labeled mixtures composed of known molar ratios and demonstrated in a real sample by measuring the effect of osmotic stress on protein expression in Saccharomyces cerevisiae.
We report the results of our work to facilitate protein identification using tandem mass spectra and protein sequence databases. We describe a parallel version of SEQUEST (SEQUEST-PVM) that is tolerant toward arithmetic exceptions. The changes we report effectively separate search processes on slave nodes from each other. Therefore, if one of the slave nodes drops out of the cluster due to an error, the rest of the cluster will carry the search process to the end. SEQUEST has been widely used for protein identifications. The modifications made to the code improve its stability and effectiveness in a high-throughput production environment. We evaluate the overhead associated with the parallelization of SEQUEST. A prior version of software to preprocess LC/MS/MS data attempted to differentiate the charge states of ions. Singly charged ions can be accurately identified, but the software was unable to reliably differentiate tandem mass spectra of +2 and +3 charge states. We have designed and implemented a computational approach to narrow charge states of precursor ions from nominal resolution ion-trap tandem mass spectra. The preprocessing code, 2to3, determines the charge state of the precursor ion using its mass-to-charge ratio (m/z) and fragment ions contained in the tandem mass spectrum. For each possible charge state the program calculates the expected fragment ions that account for precursor ion m/z values. If any one of the numbers is less than an empirically determined threshold value then the spectrum corresponding to that charge state is removed. If both numbers are higher than the threshold value then +2 and +3 copies of the spectrum are kept. We present the comparison of results from protein identification experiments with and without using 2 to 3. It is shown that by determining the charge state and eliminating poor quality spectra 2to3 decreases the number of spectral files to be searched without affecting the search results. The decrease reduces computer requirements and researcher efforts for analysis of the results.
We present a new probability-based method for protein identification using tandem mass spectra and protein databases. The method employs a hypergeometric distribution to model frequencies of matches between fragment ions predicted for peptide sequences with a specific (M + H)+ value (at some mass tolerance) in a protein sequence database and an experimental tandem mass spectrum. The hypergeometric distribution constitutes null hypothesis-all peptide matches to a tandem mass spectrum are random. It is used to generate a score characterizing the randomness of a database sequence match to an experimental tandem mass spectrum and to determine the level of significance of the null hypothesis. For each tandem mass spectrum and database search, a peptide is identified that has the least probability of being a random match to the spectrum and the corresponding level of significance of the null hypothesis is determined. To check the validity of the hypergeometric model in describing fragment ion matches, we used chi2 test. The distribution of frequencies and corresponding hypergeometric probabilities are generated for each tandem mass spectrum. No proteolytic cleavage specificity is used to create the peptide sequences from the database. We do not use any empirical probabilities in this method. The scores generated by the hypergeometric model do not have a significant molecular weight bias and are reasonably independent of database size. The approach has been implemented in a database search algorithm, PEP_PROBE. By using a large set of tandem mass spectra derived from a set of peptides created by digestion of a collection of known proteins using four different proteases, a false positive rate of 5% is demonstrated.
We recently developed a method for estimating protin dynamics in vivo with 2H2O using MALDI-TOF MS (Rachdaoui N. et al., MCP, 8, 2653-2662, 2009) and we confirmed that 2H-labeling of many hepatic free amino acids rapidly equilibrated with body water. Although this is a reliable method, it required modest sample purification and necessitated the determination of tissue-specific amino acid labeling. Another approach for quantifying protein kinetics is to measure the 2H-enrichments of body water (precursor) and protein-bound amino acid or proteolytic peptide (product) and to estimate how many copies of deuterium are incorporated into a product. In this study we have used nanospray LTQ-FTICR mass spectrometry to simultaneously measure the isotopic enrichment of peptides and protein-bound amino acids. A mathematical algorithm was developed to aid the data processing. The most notable improvement centers on the fact that the precursor:product labeling ratio can be obtained by measuring the labeling of water and a protein(s) (or peptides) of interest, therein minimizing the need to measure the amino acid labeling. As a proof of principle, we demonstrate that this approach can detect the effect of nutritional status on albumin synthesis in rats given 2H2O.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.