Abstract:Percolator is an important tool for greatly improving the results of a database search and subsequent downstream analysis. Using support vector machines (SVMs), Percolator recalibrates peptide-spectrum matches based on the learned decision boundary between targets and decoys. To improve analysis time for large-scale data sets, we update Percolator's SVM learning engine through software and algorithmic optimizations rather than heuristic approaches that necessitate the careful study of their impact on learned p… Show more
“…We screened out significant risk genes through feature selection and optimization. An SVM model was trained using ten-fold crossvalidation [18]. The SVM model is a supervised classification algorithm of machine learning.…”
Section: Construction Of Classification Model By Svmmentioning
“…We screened out significant risk genes through feature selection and optimization. An SVM model was trained using ten-fold crossvalidation [18]. The SVM model is a supervised classification algorithm of machine learning.…”
Section: Construction Of Classification Model By Svmmentioning
“…The second bottleneck is the execution time required to learn SVM parameters. Recent work 11 has tackled this bottleneck through software optimizations to Percolator's SVM learning engine, and our efforts complement and further improve upon these optimizations. On a massive data set containing over 215 million PSMs, the new version of Percolator achieves an overall speedup of 439% (81.4 h down to 18.6 h).…”
Section: Contributionsmentioning
confidence: 99%
“…Finally, we optimized the CGLS solver itself using a mixture of low-level linear algebra function calls and software streamlining, as described previously. 11 Optimizations are compared against the recently described CGLS multithreaded speedup, 11 referred to as CGLS-par. In contrast to the second in our series of optimizations, which uses multiple threads to parallelize runs of CGLS at the crossvalidation level, CGLS-par instead uses multiple threads to parallelize computation within the CGLS algorithm.…”
The processing of peptide tandem mass spectrometry data involves matching observed spectra against a sequence database. The ranking and calibration of these peptide-spectrum matches can be improved substantially using a machine learning postprocessor. Here, we describe our efforts to speed up one widely used postprocessor, Percolator. The improved software is dramatically faster than the previous version of Percolator, even when using relatively few processors. We tested the new version of Percolator on a data set containing over 215 million spectra and recorded an overall reduction to 23% of the running time as compared to the unoptimized code. We also show that the memory footprint required by these speedups is modest relative to that of the original version of Percolator.
“…Recent advances in machine learning tools and widespread use of high throughput techniques provides a massive amount of data as a source to develop tools for every step in MSbased workflows (Bouwmeester et al, 2020). For example, the post-processing tool Percolator (Käll et al, 2007;Halloran and Rocke, 2018) integrates several features into a semi-supervised learning algorithm to improve the distinction between true and false peptide-spectrum matches. Next to that, spectrum intensity predictors, such as MS 2 PIP (Degroeve et al, 2015;Gabriels et al, 2019) and Prosit (Gessulat et al, 2019) are new models that incorporate fragment ion intensities predictions as additional features next to the standard m/z ratio during spectral library searching to increase the resolution of the identification, even in challenging workflows such as proteogenomics (Verbruggen et al, 2021).…”
Bioactive peptides exhibit key roles in a wide variety of complex processes, such as regulation of body weight, learning, aging, and innate immune response. Next to the classical bioactive peptides, emerging from larger precursor proteins by specific proteolytic processing, a new class of peptides originating from small open reading frames (sORFs) have been recognized as important biological regulators. But their intrinsic properties, specific expression pattern and location on presumed non-coding regions have hindered the full characterization of the repertoire of bioactive peptides, despite their predominant role in various pathways. Although the development of peptidomics has offered the opportunity to study these peptides in vivo, it remains challenging to identify the full peptidome as the lack of cleavage enzyme specification and large search space complicates conventional database search approaches. In this study, we introduce a proteogenomics methodology using a new type of mass spectrometry instrument and the implementation of machine learning tools toward improved identification of potential bioactive peptides in the mouse brain. The application of trapped ion mobility spectrometry (tims) coupled to a time-of-flight mass analyzer (TOF) offers improved sensitivity, an enhanced peptide coverage, reduction in chemical noise and the reduced occurrence of chimeric spectra. Subsequent machine learning tools MS2PIP, predicting fragment ion intensities and DeepLC, predicting retention times, improve the database searching based on a large and comprehensive custom database containing both sORFs and alternative ORFs. Finally, the identification of peptides is further enhanced by applying the post-processing semi-supervised learning tool Percolator. Applying this workflow, the first peptidomics workflow combined with spectral intensity and retention time predictions, we identified a total of 167 predicted sORF-encoded peptides, of which 48 originating from presumed non-coding locations, next to 401 peptides from known neuropeptide precursors, linked to 66 annotated bioactive neuropeptides from within 22 different families. Additional PEAKS analysis expanded the pool of SEPs on presumed non-coding locations to 84, while an additional 204 peptides completed the list of peptides from neuropeptide precursors. Altogether, this study provides insights into a new robust pipeline that fuses technological advancements from different fields ensuring an improved coverage of the neuropeptidome in the mouse brain.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.