Genome annotation is central to today's proteomic research as it draws the outlines of the proteomic landscape. Traditional models of open reading frame (ORF) annotation impose two arbitrary criteria: a minimum length of 100 codons and a single ORF per transcript. However, a growing number of studies report expression of proteins from allegedly non-coding regions, challenging the accuracy of current genome annotations. These novel proteins were found encoded either within non-coding RNAs, 5' or 3' untranslated regions (UTRs) of mRNAs, or overlapping a known coding sequence (CDS) in an alternative ORF. OpenProt is the first database that enforces a polycistronic model for eukaryotic genomes, allowing annotation of multiple ORFs per transcript. OpenProt is freely accessible and offers custom downloads of protein sequences across 10 species. Using OpenProt database for proteomic experiments enables novel proteins discovery and highlights the polycistronic nature of eukaryotic genes. The size of OpenProt database (all predicted proteins) is substantial and need be taken in account for the analysis. However, with appropriate false discovery rate (FDR) settings or the use of a restricted OpenProt database, users will gain a more realistic view of the proteomic landscape. Overall, OpenProt is a freely available tool that will foster proteomic discoveries.
Video LinkThe video component of this article can be found at https://www.jove.com/video/59589/ 15 . A proportion of these predicted ORFs would be random and non-functional, which is why OpenProt cumulates experimental and functional evidence to increase confidence. Experimental evidence include protein expression (by MS) and translation evidence (by ribosome profiling) 15 . Functional evidence include protein orthology (with an In-Paranoid like approach) and functional domain prediction 15 .OpenProt offers the possibility to download several databases, from containing only well-supported proteins to custom-made databases. Here, we will present a pipeline for the use of OpenProt databases and will offer insights into which database to choose considering the experimental aim. The proteomics analysis pipeline presented here is supported by the Galaxy framework as it is open-access and easy-to-use, but the databases can work with any workflow 16,17,18 . We will also present how to use the OpenProt website for gathering further information on novel proteins detected by MS. Using OpenProt databases will provide a more exhaustive view of the proteomic landscape and will foster proteomics and biomarkers discoveries in a more systematic way than current methods. This protocol highlights the use of OpenProt databases 15 when interrogating MS datasets; it will not review the design of the experiment itself,