The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The number of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on average around 500 datasets per month during 2021. In addition to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Additionally, the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.
The ProteomeXchange (PX) consortium of proteomics resources (http://www.proteomexchange.org) has standardized data submission and dissemination of mass spectrometry proteomics data worldwide since 2012. In this paper, we describe the main developments since the previous update manuscript was published in Nucleic Acids Research in 2017. Since then, in addition to the four PX existing members at the time (PRIDE, PeptideAtlas including the PASSEL resource, MassIVE and jPOST), two new resources have joined PX: iProX (China) and Panorama Public (USA). We first describe the updated submission guidelines, now expanded to include six members. Next, with current data submission statistics, we demonstrate that the proteomics field is now actively embracing public open data policies. At the end of June 2019, more than 14 100 datasets had been submitted to PX resources since 2012, and from those, more than 9 500 in just the last three years. In parallel, an unprecedented increase of data re-use activities in the field, including ‘big data’ approaches, is enabling novel research and new data resources. At last, we also outline some of our future plans for the coming years.
The EMBL-EBI Expression Atlas is an added value knowledge base that enables researchers to answer the question of where (tissue, organism part, developmental stage, cell type) and under which conditions (disease, treatment, gender, etc) a gene or protein of interest is expressed. Expression Atlas brings together data from >4500 expression studies from >65 different species, across different conditions and tissues. It makes these data freely available in an easy to visualise form, after expert curation to accurately represent the intended experimental design, re-analysed via standardised pipelines that rely on open-source community developed tools. Each study's metadata are annotated using ontologies. The data are re-analyzed with the aim of reproducing the original conclusions of the underlying experiments. Expression Atlas is currently divided into Bulk Expression Atlas and Single Cell Expression Atlas. Expression Atlas contains data from differential studies (microarray and bulk RNA-Seq) and baseline studies (bulk RNA-Seq and proteomics), whereas Single Cell Expression Atlas is currently dedicated to Single Cell RNA-Sequencing (scRNA-Seq) studies. The resource has been in continuous development since 2009 and it is available at https://www.ebi.ac.uk/gxa.
The availability of proteomics datasets in the public domain, and in the PRIDE database in particular, has increased dramatically in recent years. This unprecedented large-scale availability of data provides an opportunity for combined analyses of datasets to get organism-wide protein expression data in a consistent manner. We have reanalysed 25 public proteomics datasets from healthy human individuals, to assess baseline protein abundance in 32 organs. We defined tissue as a distinct functional or structural region within an organ. Overall, the aggregated dataset contains 68 healthy tissues, corresponding to 3,167 mass spectrometry runs covering 501 samples, coming from 492 individuals. We compared protein expression between the different organs, studied the distribution of proteins across organs, and identified proteins, as well as their isoforms, that are uniquely expressed in certain organs. We also performed gene ontology and pathway enrichment analyses to identify organ-specific enriched biological processes and pathways. As a key point, we have integrated the protein expression results into the resource Expression Atlas, where it can be accessed and visualised either individually or together with gene expression data coming from transcriptomics datasets.
Background:The 5-year survival rate of patients with pancreatic ductal adenocarcinoma (PDAC) is around 5% due to the fact that the majority of patients present with advanced disease that is treatment resistant. Familial pancreatic cancer (FPC) is a rare disorder that is defined as a family with at least two affected first degree relatives, with an estimated incidence of 4%À10%. The genetic basis is unknown in the majority of families although around 10%À13% of families carry germline mutations in known genes associated with hereditary cancer and pancreatitis syndromes. Methods: Panel sequencing was performed of 35 genes associated with hereditary cancer in 43 PDAC cases from families with an apparent hereditary pancreatic cancer syndrome. Findings: Pathogenic variants were identified in 19% (5/26) of PDAC cases from pure FPC families in the genes MLH1, CDKN2A, POLQ and FANCM. Low frequency potentially pathogenic VUS were also identified in 35% (9/26) of PDAC cases from FPC families in the genes FANCC, MLH1, PMS2, CFTR, APC and MUTYH. Furthermore, an important proportion of PDAC cases harboured more than one pathogenic, likely pathogenic or potentially pathogenic VUS, highlighting the multigene phenotype of FPC. Interpretation: The genetic basis of familial or hereditary pancreatic cancer can be explained in 21% of families by previously described hereditary cancer genes. Low frequency variants in other DNA repair genes are also present in 35% of families which may contribute to the risk of pancreatic cancer development. Funding: This study was funded by the Instituto de Salud Carlos III (Plan Estatal de I + D + i 2013À2016): ISCIII (PI09/02221, PI12/01635, PI15/02101 and PI18/1034) and co-financed by the European Development Regional Fund ''A way to achieve Europe'' (ERDF), the Biomedical Research Network in Cancer: CIBERONC (CB16/12/00446), Red Tem atica de investigaci on cooperativa en c ancer: RTICC (RD12/0036/0073) and La Asociaci on Española contra el C ancer: AECC (Grupos Coordinados Estables 2016).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.