A proteomics sample metadata representation for multiomics integration and big data analysis

Dai, Chengxin; Füllgrabe, Anja; Pfeuffer, Julianus; Solovyeva, Elizaveta M.; Deng, Jingwen; Moreno, Pablo; Kamatchinathan, Selvakumar; Kundu, Deepti Jaiswal; George, Nancy; Fexová, Silvie; Grüning, Björn; Föll, Melanie Christine; Griss, Johannes; Vaudel, Marc; Audain, Enrique; Locard‐Paulet, Marie; Turewicz, Michael; Eisenacher, Martin; Uszkoreit, Julian; Bossche, Tim Van Den; Schwämmle, Veit; Webel, Henry; Schulze, Stefan; Bouyssié, David; Jayaram, Savita; Duggineni, Vinay Kumar; Samaras, Patroklos; Wilhelm, Mathias; Choi, Meena; Wang, Mingxun; Kohlbacher, Oliver; Brāzma, Alvis; Papatheodorou, Irene; Bandeira, Nuno; Deutsch, Eric W.; Vizcaíno, Juan Antonio; Bai, Mingze; Sachsenberg, Timo; Levitsky, Lev I.; Perez‐Riverol, Yasset

doi:10.1038/s41467-021-26111-3

Cited by 60 publications

(54 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As also reported in previous studies, one of the major bottlenecks was the curation of dataset metadata, consisting of mapping files to samples and biological conditions. Very recently, the MAGE-TAB-Proteomics format has been developed and formalised to enable the reporting of the experimental design in proteomics experience, including the relationship between samples and raw files, which is recorded in the SDRF-Proteomics section of the file [ 42 ]. Submission of the SDRF-Proteomics files to PRIDE is now supported.…”

Section: Discussionmentioning

confidence: 99%

Integrated view and comparative analysis of baseline protein expression in mouse and rat tissues

et al. 2022

Self Cite

View full text Add to dashboard Cite

The increasingly large amount of proteomics data in the public domain enables, among other applications, the combined analyses of datasets to create comparative protein expression maps covering different organisms and different biological conditions. Here we have reanalysed public proteomics datasets from mouse and rat tissues (14 and 9 datasets, respectively), to assess baseline protein abundance. Overall, the aggregated dataset contained 23 individual datasets, including a total of 211 samples coming from 34 different tissues across 14 organs, comprising 9 mouse and 3 rat strains, respectively. In all cases, we studied the distribution of canonical proteins between the different organs. The number of canonical proteins per dataset ranged from 273 (tendon) and 9,715 (liver) in mouse, and from 101 (tendon) and 6,130 (kidney) in rat. Then, we studied how protein abundances compared across different datasets and organs for both species. As a key point we carried out a comparative analysis of protein expression between mouse, rat and human tissues. We observed a high level of correlation of protein expression among orthologs between all three species in brain, kidney, heart and liver samples, whereas the correlation of protein expression was generally slightly lower between organs within the same species. Protein expression results have been integrated into the resource Expression Atlas for widespread dissemination.

show abstract

Section: Discussionmentioning

confidence: 99%

Integrated view and comparative analysis of baseline protein expression in mouse and rat tissues

et al. 2022

Self Cite

View full text Add to dashboard Cite

show abstract

“…), prevents a more streamlined reuse of the available data, especially in the case of reanalyses of quantitative proteomics datasets. The MAGE-TAB for proteomics ( 34 ), an extension of the format original MAGE-TAB format used in transcriptomics ( 35 ), has been recently proposed to capture the sample metadata, and the experimental design for proteomics experiments (Figure 2 ).…”

Section: Current Status Of the Pride Ecosystem: Resources And Toolsmentioning

confidence: 99%

“…The SDRF-Proteomics is a tab-delimited format where each column is a property of the sample or the data file. Each row corresponds to the relation between a sample and a data file, and each cell is the value of the property for the sample or the data file ( 34 ) ( https://github.com/bigbio/proteomics-metadata-standard ).…”

Section: Current Status Of the Pride Ecosystem: Resources And Toolsmentioning

confidence: 99%

The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences

Perez‐Riverol

Bai

Bandla

et al. 2021

Nucleic Acids Research

Self Cite

3,909

2,678

View full text Add to dashboard Cite

The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The number of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on average around 500 datasets per month during 2021. In addition to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Additionally, the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.

show abstract

“…Mapping raw file names in PRIDE to the samples in the original publication was done manually and it constituted one of the most time-consuming steps in this work. In the context of the activities of the Proteomics Standards Initiative, a standard file format called SDRF-Proteomics (Sample and Data Relationship Format-Proteomics) file (as part of the file format MAGE-TAB-Proteomics) has been formalised recently 34 for capturing the experimental design in proteomics experiments 3 , and we have started working in the related tooling to facilitate the creation of these files. It is important to highlight that submission of SDRF-Proteomics files is already supported by PRIDE, although it is optional at the time of writing.…”

Section: Discussionmentioning

confidence: 99%

Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas

et al. 2022

Self Cite

View full text Add to dashboard Cite

The number of mass spectrometry (MS)-based proteomics datasets in the public domain keeps increasing, particularly those generated by Data Independent Acquisition (DIA) approaches such as SWATH-MS. Unlike Data Dependent Acquisition datasets, the re-use of DIA datasets has been rather limited to date, despite its high potential, due to the technical challenges involved. We introduce a (re-)analysis pipeline for public SWATH-MS datasets which includes a combination of metadata annotation protocols, automated workflows for MS data analysis, statistical analysis, and the integration of the results into the Expression Atlas resource. Automation is orchestrated with Nextflow, using containerised open analysis software tools, rendering the pipeline readily available and reproducible. To demonstrate its utility, we reanalysed 10 public DIA datasets from the PRIDE database, comprising 1,278 SWATH-MS runs. The robustness of the analysis was evaluated, and the results compared to those obtained in the original publications. The final expression values were integrated into Expression Atlas, making SWATH-MS experiments more widely available and combining them with expression data originating from other proteomics and transcriptomics datasets.

show abstract

A proteomics sample metadata representation for multiomics integration and big data analysis

Cited by 60 publications

References 37 publications

Integrated view and comparative analysis of baseline protein expression in mouse and rat tissues

Integrated view and comparative analysis of baseline protein expression in mouse and rat tissues

The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences

Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas

Contact Info

Product

Resources

About