Recent advances in next-generation DNA sequencing and proteomics provide an unprecedented ability to survey mRNA and protein abundances. Such proteome-wide surveys are illuminating the extent to which different aspects of gene expression help to regulate cellular protein abundances. Current data demonstrate a substantial role for regulatory processes occurring after mRNA is made — that is, post-transcriptional, translational and protein degradation regulation — in controlling steady-state protein abundances. Intriguing observations are also emerging in relation to cells following perturbation, single-cell studies and the apparent evolutionary conservation of protein and mRNA abundances. Here, we summarize current understanding of the major factors regulating protein expression.
We report a method for large-scale absolute protein expression measurements (APEX) and apply it to estimate the relative contributions of transcriptional- and translational-level gene regulation in the yeast and Escherichia coli proteomes. APEX relies upon correcting each protein's mass spectrometry sampling depth (observed peptide count) by learned probabilities for identifying the peptides. APEX abundances agree with measurements from controls, western blotting, flow cytometry and two-dimensional gels, as well as known correlations with mRNA abundances and codon bias, providing absolute protein concentrations across approximately three to four orders of magnitude. Using APEX, we demonstrate that 73% of the variance in yeast protein abundance (47% in E. coli) is explained by mRNA abundance, with the number of proteins per mRNA log-normally distributed about approximately 5,600 ( approximately 540 in E. coli) protein molecules/mRNA. Therefore, levels of both eukaryotic and prokaryotic proteins are set per mRNA molecule and independently of overall protein concentration, with >70% of yeast gene expression regulation occurring through mRNA-directed mechanisms.
Cellular states are determined by differential expression of the cell’s proteins. The relationship between protein and mRNA expression levels informs about the combined outcomes of translation and protein degradation which are, in addition to transcription and mRNA stability, essential contributors to gene expression regulation. This review summarizes the state of knowledge about large-scale measurements of absolute protein and mRNA expression levels, and the degree of correlation between the two parameters. We summarize the information that can be derived from comparison of protein and mRNA expression levels and discuss how corresponding sequence characteristics suggest modes of regulation.
We provide a large-scale dataset on absolute protein and matching mRNA concentrations from the human medulloblastoma cell line Daoy. The correlation between mRNA and protein concentrations is significant and positive (Rs=0.46, R2=0.29, P-value<2e16), although non-linear.Out of ∼200 tested sequence features, sequence length, frequency and properties of amino acids, as well as translation initiation-related features are the strongest individual correlates of protein abundance when accounting for variation in mRNA concentration.When integrating mRNA expression data and all sequence features into a non-parametric regression model (Multivariate Adaptive Regression Splines), we were able to explain up to 67% of the variation in protein concentrations. Half of the contributions were attributed to mRNA concentrations, the other half to sequence features relating to regulation of translation and protein degradation. The sequence features are primarily linked to the coding and 3′ untranslated region. To our knowledge, this is the most comprehensive predictive model of human protein concentrations achieved so far.
Plants do not grow as axenic organisms in nature, but host a diverse community of microorganisms, termed the plant microbiota. There is an increasing awareness that the plant microbiota plays a role in plant growth and can provide protection from invading pathogens. Apart from intense research on crop plants, Arabidopsis is emerging as a valuable model system to investigate the drivers shaping stable bacterial communities on leaves and roots and as a tool to decipher the intricate relationship among the host and its colonizing microorganisms. Gnotobiotic experimental systems help establish causal relationships between plant and microbiota genotypes and phenotypes and test hypotheses on biotic and abiotic perturbations in a systematic way. We highlight major recent findings in plant microbiota research using comparative community profiling and omics analyses, and discuss these approaches in light of community establishment and beneficial traits like nutrient acquisition and plant health.
Most proteins have been formed by gene duplication, recombination, and divergence. Proteins of known structure can be matched to about 50% of genome sequences, and these data provide a quantitative description and can suggest hypotheses about the origins of these processes.
SUPERFAMILY provides structural, functional and evolutionary information for proteins from all completely sequenced genomes, and large sequence collections such as UniProt. Protein domain assignments for over 900 genomes are included in the database, which can be accessed at http://supfam.org/. Hidden Markov models based on Structural Classification of Proteins (SCOP) domain definitions at the superfamily level are used to provide structural annotation. We recently produced a new model library based on SCOP 1.73. Family level assignments are also available. From the web site users can submit sequences for SCOP domain classification; search for keywords such as superfamilies, families, organism names, models and sequence identifiers; find over- and underrepresented families or superfamilies within a genome relative to other genomes or groups of genomes; compare domain architectures across selections of genomes and finally build multiple sequence alignments between Protein Data Bank (PDB), genomic and custom sequences. Recent extensions to the database include InterPro abstracts and Gene Ontology terms for superfamiles, taxonomic visualization of the distribution of families across the tree of life, searches for functionally similar domain architectures and phylogenetic trees. The database, models and associated scripts are available for download from the ftp site.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.