Microproteins are peptides composed of 100 amino acids (AA) or fewer, encoded by small open reading frames (smORFs). It has been demonstrated that microproteins participate in and regulate a wide range of functions in cells. However, the annotation and identification of microproteins is challenging in part owing to their low molecular weight, low abundancy, and hydrophobicity. These factors have led to the unannotation of smORFs in genome processing and have made their identification at the protein level difficult. Large-scale enrichment of microproteins in proteogenomics has made it possible to efficiently identify microproteins and discover unannotated smORFs in Saccharomyces cerevisiae. We integrated four microprotein-specific enrichment strategies to enhance coverage. We identified 117 microproteins, verified 31 missing proteins (MPs), and discovered 3 novel smORFs. In total, 31 proteins were confirmed as MPs by spectrum quality checking. Three novel smORFs (YKL104W-A, YHR052C-B, and YHR054C-B) were reserved after spectrum quality checking, peptide synthesizing, homologue matching, and so on. This study not only demonstrates that there are potential smORF candidates to be annotated in an extensively studied organism but also presents an efficient strategy for the discovery of small MPs. All MS data sets have been deposited to the ProteomeXchange with identifier PXD008586.
In 2012, the Chromosome-centric Human Proteome Project (C-HPP) launched an investigation for missing proteins (MPs) to complete the Human Proteome Project (HPP). The majority of the MPs were distributed in lowmolecular-weight (LMW) ranges, especially from 0 to 40 kDa. LMW protein identification is challenging, owing to their short length, low abundance, and hydrophobicity. Furthermore, many sequences from trypsin digestion are unlikely to yield detectable peptides or a reasonable quality of MS 2 spectrum. Therefore, we focused on small MPs by combining LMW protein enrichment and a pair of complementary proteases strategy with trypsin and LysargiNase for human testis samples. In-depth testis LMW protein profiling resulted in the identification of 4063 proteins, of which 2565 were LMW proteins and 1130 had pairs of peptides generated from both trypsin and LysargiNase. This provided additional mass spectral evidence of further verification of small MPs. Finally, two MPs were verified from the seven MP candidates. One of them, Q8N688, was verified with two series of continuous and complementary b/y-product ions from the pairs of spectra for tryptic and LysargiNase digested peptides after the "mirror spectrum" matching. This make the confident identification of the representative peptides for the target MPs. On the contrary, the two verified peptides for Q86WR6 were identified with the same strategy from the gel-separation and gelelution samples, respectively. Although the other five MP candidates showed high-quality spectra, they could not be sufficiently distinguished as PE1s and require further verification. All MS data sets have been deposited in the ProteomeXchange with identifier PXD010093.
In eukaryotes, alternative pre-mRNA splicing allows a single gene to encode different protein isoforms that function in many biological processes, and they are used as biomarkers or therapeutic targets for diseases. Although protein isoforms in the human genome are well annotated, we speculate that some low-abundance protein isoforms may still be under-annotated because most genes have a primary coding product and alternative protein isoforms tend to be under-expressed. A peptide coencoded by a novel exon and an annotated exon separated by an intron is known as a novel junction peptide. In the absence of known transcripts and homologous proteins, traditional whole-genome six-frame translation-based proteogenomics cannot identify novel junction peptides, and it cannot capture novel alternative splice sites. In this article, we first propose a strategy and tool for identifying novel junction peptides, called CJunction, which we then integrate into a proteogenomics process specifically designed for novel protein isoform discovery and apply to the analysis of a deep-coverage HeLa mass spectrometry data set with identifier PXD004452 in ProteomeXchange. We succeeded in identifying and validating three novel protein isoforms of two functionally important genes, NHSL1 (causative gene of Nance-Horan syndrome) and EEF1B2 (translation elongation factor), which validate our hypothesis. These novel protein isoforms have significant sequence differences from the annotated gene-coding products introduced by the novel N-terminal, suggesting that they may play importantly different functions.
Theoretical advances of the structures and catalytic activities of small-sized gold nanoclusters
Alternative splicing allows a small number of human genes to encode large amounts of proteoforms that play essential roles in normal and disease physiology. Some low-abundance proteoforms may remain undiscovered due to limited detection and analysis capabilities. Peptides coencoded by novel exons and annotated exons separated by introns are called novel junction peptides, which are the key to identifying novel proteoforms. Traditional de novo sequencing does not take into account the specificity in the composition of the novel junction peptide and is therefore not as accurate. We first developed a novel de novo sequencing algorithm, CNovo, which outperformed the mainstream PEAKS and Novor in all six test sets. We then built on CNovo to develop a semi-de novo sequencing algorithm, SpliceNovo, specifically for identifying novel junction peptides. SpliceNovo identifies junction peptides with much higher accuracy than CNovo, CJunction, PEAKS, and Novor. Of course, it is also possible to replace the built-in CNovo in SpliceNovo with other more accurate de novo sequencing algorithms to further improve its performance. We also successfully identified and validated two novel proteoforms of the human EIF4G1 and ELAVL1 genes by SpliceNovo. Our results significantly improve the ability to discover novel proteoforms through de novo sequencing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.