Peptimapper: proteogenomics workflow for the expert annotation of eukaryotic genomes

Guillot, Laetitia; Delage, Ludovic; Viari, Alain; Vandenbrouck, Yves; Com, Emmanuelle; Ritter, Andrés; Lavigne, Régis; Marie, Dominique; Peterlongo, Pierre; Potin, Philippe; Pineau, Charles

doi:10.1186/s12864-019-5431-9

Cited by 12 publications

(8 citation statements)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…de novo and database search. There are only a few tools that perform de novo deduction such as Peptimapper [ 183 ], IggyPep [ 184 ], and Pepline [ 185 ]. There is a wide variety of tools using different runtime environments, inputs, peptide search engines, scoring methods, FDR analysis, and visualizations in database search.…”

Section: Proteogenomicsmentioning

confidence: 99%

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

et al. 2021

View full text Add to dashboard Cite

Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques’ relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.

show abstract

Section: Proteogenomicsmentioning

confidence: 99%

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

et al. 2021

View full text Add to dashboard Cite

show abstract

“…In bottom-up proteomics, the mass spectra of (often tryptic) peptides are matched against their in silico digested counterparts generated from a database. Under a broader proteogenomic framework, various computational strategies have been developed to integrate proteomic data with (canonical and non-canonical) genomic annotation pipelines or to generate standalone in silico translation databases for discovery of novel proteins ( Risk et al, 2013 ; Jagtap et al, 2014 ; Mackowiak et al, 2015 ; Nagaraj et al, 2015 ; Zickmann and Renard, 2015 ; Kolmogorov et al, 2016 ; Olexiouk et al, 2016 ; Brunet et al, 2018 ; Guillot et al, 2019 ). At the MS-based experimental front, various fractionation and small protein enrichment methods have been employed to successfully identify novel non-canonical proteins in eukaryotic cell lines and tissues ( Ma et al, 2016a ; Li et al, 2017 ; He et al, 2018 ; Cao et al, 2020 ; Cardon et al, 2020 ; Kaulich et al, 2020 ; Cassidy et al, 2021 ; Wang et al, 2021 ).…”

Section: Introductionmentioning

confidence: 99%

Identification of Non-Canonical Translation Products in C. elegans Using Tandem Mass Spectrometry

et al. 2021

View full text Add to dashboard Cite

Transcriptome and ribosome sequencing have revealed the existence of many non-canonical transcripts, mainly containing splice variants, ncRNA, sORFs and altORFs. However, identification and characterization of products that may be translated out of these remains a challenge. Addressing this, we here report on 552 non-canonical proteins and splice variants in the model organism C. elegans using tandem mass spectrometry. Aided by sequencing-based prediction, we generated a custom proteome database tailored to search for non-canonical translation products of C. elegans. Using this database, we mined available mass spectrometric resources of C. elegans, from which 51 novel, non-canonical proteins could be identified. Furthermore, we utilized diverse proteomic and peptidomic strategies to detect 40 novel non-canonical proteins in C. elegans by LC-TIMS-MS/MS, of which 6 were common with our meta-analysis of existing resources. Together, this permits us to provide a resource with detailed annotation of 467 splice variants and 85 novel proteins mapped onto UTRs, non-coding regions and alternative open reading frames of the C. elegans genome.

show abstract

“…Previously, scientists looked for protein evidence of a small number of variants in particular and resorted to targeted proteomics approaches such as selected reaction monitoring (SRM) [9][10][11][12] . Alternatively, BLAST-like query tools such as peptimapper and PepQuery 13,14 or database tools like XMAn v2 15 and dbSAP 16 can be used to investigate single events 17,18 .…”

Section: Introductionmentioning

confidence: 99%

The Personalized Proteome: Comparing Proteogenomics and Open Variant Search Approaches for Single Amino Acid Variant Detection

Salz

Bouwmeester

Gabriels

et al. 2020

Preprint

View full text Add to dashboard Cite

Discovery of variant peptides such as single amino acid variant (SAAV) in shotgun proteomics data is essential for personalized proteomics. Both the resolution of shotgun proteomics methods and the search engines have improved dramatically, allowing for confident identification of SAAV peptides. However, it is not yet known if these methods are truly successful in accurately identifying SAAV peptides without prior genomic information in the search database. We studied this in unprecedented detail by exploiting publicly available long-read RNA seq and shotgun proteomics data from the gold standard reference cell line NA12878. Searching spectra from this cell line with the state-of-the-art open modification search engine ionbot against carefully curated search databases resulted in 96.7% false positive SAAVs and an 85% lower true positive rate than searching with peptide search databases that incorporate prior genetic information. While adding genetic variants to the search database remains indispensable for correct peptide identification, inclusion of long-read RNA sequences in the search database contributes only 0.3% new peptide identifications. These findings reveal the differences in SAAV detection that result from various approaches, providing guidance to researchers studying SAAV peptides and developers of peptide spectrum identification tools.

show abstract

Peptimapper: proteogenomics workflow for the expert annotation of eukaryotic genomes

Cited by 12 publications

References 68 publications

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey

Identification of Non-Canonical Translation Products in C. elegans Using Tandem Mass Spectrometry

The Personalized Proteome: Comparing Proteogenomics and Open Variant Search Approaches for Single Amino Acid Variant Detection

Contact Info

Product

Resources

About