Tools to Covisualize and Coanalyze Proteomic Data with Genomes and Transcriptomes: Validation of Genes and Alternative mRNA Splicing

Pang, Chi Nam Ignatius; Tay, Aidan P.; Aya, Carlos; Twine, Natalie A.; Harkness, Linda; Hart‐Smith, Gene; Chia, Samantha Z.; Chen, Zhiliang; Deshpande, Nandan P.; Kaakoush, Nadeem O.; Mitchell, Hazel M.; Kassem, Moustapha; Wilkins, Marc R.

doi:10.1021/pr400820p

Cited by 40 publications

(60 citation statements)

References 72 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, in both cases, DBs specific for the conditions studied are generated, requiring bioinformatics expertise and limiting the general applicability of the resource. The GFF file we provide can be very valuable for other proteogenomics software solutions like GenoSuite (Kumar et al 2013), PGP (Tovchigrechko et al 2014), and PG Nexus (Pang et al 2014), which allow users to search their data against a six-frame translation and later visualize identified peptides onto a genome sequence but lack integrated and consolidated annotations.…”

Section: Discussionmentioning

confidence: 99%

“…However, they do not integrate different annotations of the same genome. PG Nexus (Pang et al 2014) uses the NCBI RefSeq annotation, a Glimmer ab initio prediction (Delcher et al 2007), and a six-frame translation against which peptides are searched with Mascot (Perkins et al 1999) and later visualized onto the genome. However, the annotations are not integrated and consolidated; the boundaries of novel ORFs still have to be discovered based on peptide evidence, which requires substantial manual effort.…”

Section: A General Integrative Proteogenomics Approachmentioning

confidence: 99%

“…Defining DB complexity as the number of distinct tryptic peptides of 6-40aa in length, the complexity of the resulting iPtgxDB was ∼50% of that of a full six-frame translated genome that Mascot (Perkins et al 1999) or PG Nexus (Pang et al 2014) would rely on to identify proteogenomic evidence for novel peptides. Despite the relatively large number of entries, the DB complexity is only 70% of that of baker's yeast and below 20% of a human protein DB (Supplemental Table S2).…”

Section: Wwwgenomeorgmentioning

confidence: 99%

“…RNA-seq data have been used to limit the protein search DB size to achieve better statistical power (Wang et al 2012;Woo et al 2013;Zickmann and Renard 2015). Other MS-friendly DB solutions that integrate data from different species or strains include MScDB (Marx et al 2013), MSMSpdbb (de Souza et al 2010, and PG Nexus (Pang et al 2014). Even pipeline solutions were developed that allow the search of proteomics data against a sixframe translation-based DB, including Peppy (Risk et al 2013), GenoSuite (Kumar et al 2013), and PGP (Tovchigrechko et al 2014).…”

mentioning

confidence: 99%

See 3 more Smart Citations

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics

et al. 2017

View full text Add to dashboard Cite

Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations. Our strategy toward accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs (a modified six-frame translation considering alternative start codons) in an integrated proteogenomics database (iPtgxDB) that covers the entire proteincoding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics data set against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and single amino acid variations (SAAVs) identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin. We release iPtgxDBs for B. henselae, Bradyrhizobium diazoefficiens and Escherichia coli and the software to generate both proteogenomics search databases and integrated annotation files that can be viewed in a genome browser for any prokaryote.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: A General Integrative Proteogenomics Approachmentioning

confidence: 99%

Section: Wwwgenomeorgmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics

et al. 2017

View full text Add to dashboard Cite

show abstract

“…It follows naturally that the ideal unified coordinate system for proteogenomics should remain genomic in nature. Indeed, effective tools that can map MS-based proteomics results onto genomic coordinates have recently become available (Peppy, 2 Proteogenomic Mapping Tool, 3 Pepline, 4 MS-Dictionary, 5 GappedDictionary, 6 IggyPep, 7 MSProGene, 8 ProteoAnnotator, 9 PGNexus, 10 and GalaxyP 11 ); however, these tools are usually couched in a relatively involved and comprehensive pipeline (e.g., the GalaxyP pipeline consists of up to 140 steps) and typically impose a specific mass-informatic 12 workflow on the practitioner, by, for example, requiring the generation of short peptide sequence tags (PSTs) or some complex form of de novo peptide sequencing followed by a lookup against the full six-frame translation of the genomic sequence. Our experience suggests that a more common scenario involves the production, by the genomic arm of the workflow, of a (liberally) predicted proteome (containing what is assumed to be a superset of the observable proteome) so as to leverage existing PSM search engines (such as Mascot, 13 Sequest, 14 X!Tandem 15 ) that require a straightforward representation of the predicted proteome (in the form of a FASTA file).…”

Section: Introductionmentioning

confidence: 99%

PGx: Putting Peptides to BED

2015

View full text Add to dashboard Cite

Every molecular player in the cast of biology’s central dogma is being sequenced and quantified with increasing ease and coverage. To bring the resulting genomic, transcriptomic, and proteomic data sets into coherence, tools must be developed that do not constrain data acquisition and analytics in any way but rather provide simple links across previously acquired data sets with minimal preprocessing and hassle. Here we present such a tool: PGx, which supports proteogenomic integration of mass spectrometry proteomics data with next-generation sequencing by mapping identified peptides onto their putative genomic coordinates.

show abstract

A tool for integrating genetic and mass spectrometry‐based peptide data: Proteogenomics Viewer

et al. 2017

View full text Add to dashboard Cite

In this manuscript we describe Proteogenomics Viewer, a web-based tool that collects MS peptide identification, indexes to genomic sequence and structure, assigns exon usage, reports the identified protein isoforms with genomic alignments and, most importantly, allows the inspection of MS2 information for proper peptide identification. It also provides all performed indexing to facilitate global analysis of the data. The relevance of such tool is that there has been an increase in the number of proteogenomic efforts to improve the annotation of both genomics and proteomics data, culminating with the release of the two human proteome drafts. It is now clear that mass spectrometry-based peptide identification of uncharacterized sequences, such as those resulting from unpredicted exon joints or non-coding regions, is still prone to a higher than expected false discovery rate. Therefore, proper visualization of the raw data and the corresponding genome alignments are fundamental for further data validation and interpretation. Also see the video abstract here: http://youtu.be/5NzyRvuk4Ac.

show abstract

Tools to Covisualize and Coanalyze Proteomic Data with Genomes and Transcriptomes: Validation of Genes and Alternative mRNA Splicing

Cited by 40 publications

References 72 publications

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics

PGx: Putting Peptides to BED

A tool for integrating genetic and mass spectrometry‐based peptide data: Proteogenomics Viewer

Contact Info

Product

Resources

About