SW-Tandem: a highly efficient tool for large-scale peptide identification with parallel spectrum dot product on Sunway TaihuLight

Li, Chuang; Li, Kenli; Chen, Tao; Zhu, Yunping; He, Qiang

doi:10.1093/bioinformatics/btz147

Cited by 13 publications

(11 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CC-BY-NC-ND 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted October 29, 2020. ; https://doi.org/10.1101/2020.10.29.359075 doi: bioRxiv preprint acquisition time to a few minute range [12][13][14] and increasing the throughput of data processing [15][16][17][18][19][20] .…”

Section: Introductionmentioning

confidence: 99%

Boosting MS1-only Proteomics with Machine Learning Allows 2000 Protein Identifications in Single-Shot Human Proteome Analysis Using 5 min HPLC Gradient

et al. 2021

View full text Add to dashboard Cite

Proteome-wide analyses most often rely on tandem mass spectrometry imposing considerable instrumental time consumption that is one of the main obstacles in a broader acceptance of proteomics in biomedical and clinical research. Recently, we presented a fast proteomic method termed DirectMS1 based on MS1-only mass spectra acquisition and data processing. The method allowed significant squeezing of the proteomewide analysis to a few minute time frame at the depth of quantitative proteome coverage of 1000 proteins at 1% FDR. In this work, to further increase the capabilities of the DirectMS1 method, we explored the opportunities presented by the recent progress in the machine learning area and applied the LightGBM tree-based learning algorithm into the scoring of peptide-feature matches when processing MS1 spectra. Further, we integrated the peptide feature identification algorithm of DirectMS1 with the recently introduced peptide retention time prediction utility, DeepLC. Additional approaches to improve performance of the DirectMS1 .

show abstract

Section: Introductionmentioning

confidence: 99%

Boosting MS1-only Proteomics with Machine Learning Allows 2000 Protein Identifications in Single-Shot Human Proteome Analysis Using 5 min HPLC Gradient

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Therefore, solving these computational problems requires data structure(s) such as hash-table, graphs, and sparse unstructured matrix computations that exhibit little memory-locality and naturally lead to unpredictable communication patterns both in space (arbitrary connections between computing components) and time (the processes or threads may need data from one another at any point in time). Further, the existing high-performance computing methods [198], [199], [200] have been designed over inherently serial designs where the database is replicated on all parallel nodes, and the experimental data are split among them. This strategy is not scalable due to the space complexity associated with indexing proteome databases with multiple PTMs (specially fragment-ion index) [201], or multiple proteome database searches are required for systems biology experiments.…”

Section: A Proteogenomic Toolsmentioning

confidence: 99%

“…These studies include Parallel Tandem [198] which spawns multiple instances of the original X!Tandem on distributed machines; X! !Tandem [209] achieves parallelism using owner-compute MPI processes; MR-Tandem [210] uses Map-Reduce instead of MPI for better speedup efficiency; MCtandem [211] employs Intel Many Integrated Core (MIC) architecture co-processor to speedup spectral dot products (SDP) for X!Tandem, and SW-Tandem [199] employs the Haswell AVX2 engine to speedup SDP computations on Sunway Taihulight supercomputer. SW-Tandem also spawns a manager process that distributes the experimental data to worker processes using a global queue for better load balancing.…”

Section: A Limitationmentioning

confidence: 99%

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

et al. 2021

View full text Add to dashboard Cite

Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques’ relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.

show abstract

“…As demonstrated by other big data fields [23], such limitations can be reduced by developing parallel algorithms that combine the computational power of thousands of processing elements across distributed-memory clusters, and supercomputers. We, and others have developed high-performance computing (HPC) techniques for processing of MS data including for multicore [3], [2], [10], [9], and distributed-memory architectures [24], [25] [26], [27], [28], [29]. Similar to serial algorithms, the objective of these HPC methods has been to speed up the arithmetic scoring part of the search engines, by spawning multiple (managed) instances of the original code, replicating the theoretical database, and splitting the experimental data.…”

Section: Mainmentioning

confidence: 99%

HiCOPS: High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry based Omics Data

Haseeb,

Saeed

2021

Preprint

View full text Add to dashboard Cite

Database-search algorithms, that deduce peptides from Mass Spectrometry (MS) data, have tried to improve the computational efficiency to accomplish larger, and more complex systems biology studies. Existing serial, and high-performance computing (HPC) search engines, otherwise highly successful, are known to exhibit poor-scalability with increasing size of theoretical search-space needed for increased complexity of modern non-model, multi-species MS-based omics analysis. Consequently, the bottleneck for computational techniques is the communication costs of moving the data between hierarchy of memory, or processing units, and not the arithmetic operations. This post-Moore change in architecture, and demands of modern systems biology experiments have dampened the overall effectiveness of the existing HPC workflows. We present a novel efficient parallel computational method, and its implementation on memory-distributed architectures for peptide identification tool called HiCOPS, that enables more than 100-fold improvement in speed over most existing HPC proteome database search tools. HiCOPS empowers the supercomputing database search concept for comprehensive identification of peptides, and all their modified forms within a reasonable time-frame. We demonstrate this by searching Gigabytes of experimental MS data against Terabytes of databases where HiCOPS completes peptide identification in few minutes using 72 parallel nodes (1728 cores) compared to several weeks required by existing state-of-the-art tools using 1 node (24 cores); 100 minutes vs 5 weeks; 500× speedup. Finally, we formulate a theoretical framework for our overhead-avoiding strategy, and report superior performance evaluation results for key metrics including execution time, CPU utilization, speedups, and I/O efficiency. We also demonstrate superior performance as compared to all existing HPC strategies.

show abstract

SW-Tandem: a highly efficient tool for large-scale peptide identification with parallel spectrum dot product on Sunway TaihuLight

Cited by 13 publications

References 7 publications

Boosting MS1-only Proteomics with Machine Learning Allows 2000 Protein Identifications in Single-Shot Human Proteome Analysis Using 5 min HPLC Gradient

Boosting MS1-only Proteomics with Machine Learning Allows 2000 Protein Identifications in Single-Shot Human Proteome Analysis Using 5 min HPLC Gradient

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

HiCOPS: High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry based Omics Data

Contact Info

Product

Resources

About