Extant protein-coding sequences span a huge range of ages, from those that emerged only recently to those present in the last universal common ancestor. Because evolution has had less time to act on young sequences, there might be ‘phylostratigraphy’ trends in any properties that evolve slowly with age. A long-term reduction in hydrophobicity and hydrophobic clustering was found in previous, taxonomically restricted studies. Here we perform integrated phylostratigraphy across 435 fully sequenced species, using sensitive HMM methods to detect protein domain homology. We find that the reduction in hydrophobic clustering is universal across lineages. However, only young animal domains have a tendency to have higher structural disorder. Among ancient domains, trends in amino acid composition reflect the order of recruitment into the genetic code, suggesting that the composition of the contemporary descendants of ancient sequences reflects amino acid availability during the earliest stages of life, when these sequences first emerged.
Protein-coding sequences can arise either from duplication and divergence of existing sequences, or de novo from noncoding DNA. Unfortunately, recently evolved de novo genes can be hard to distinguish from false positives, making their study difficult. Here, we study a more tractable version of the process of conversion of noncoding sequence into coding: the co-option of short segments of noncoding sequence into the C-termini of existing proteins via the loss of a stop codon. Because we study recent additions to potentially old genes, we are able to apply a variety of stringent quality filters to our annotations of what is a true protein-coding gene, discarding the putative proteins of unknown function that are typical of recent fully de novo genes. We identify 54 examples of C-terminal extensions in Saccharomyces and 28 in Drosophila, all of them recent enough to still be polymorphic. We find one putative gene fusion that turns out, on close inspection, to be the product of replicated assembly errors, further highlighting the issue of false positives in the study of rare events. Four of the Saccharomyces C-terminal extensions (to ADH1, ARP8, TPM2, and PIS1) that survived our quality filters are predicted to lead to significant modification of a protein domain structure.
Proteins are the workhorses of the cell, yet they carry great potential for harm via misfolding and aggregation. Despite the dangers, proteins are sometimes born de novo from non-coding DNA. Proteins are more likely to be born from non-coding regions that produce peptides that do little to no harm when translated than from regions that produce harmful peptides. To investigate which newborn proteins are most likely to “first, do no harm”, we estimate fitnesses from an experiment that competed Escherichia coli lineages that each expressed a unique random peptide. A variety of peptide metrics significantly predict lineage fitness, but this predictive power stems from simple amino acid frequencies rather than the ordering of amino acids. Amino acids that are smaller and that promote intrinsic structural disorder have more benign fitness effects. We validate that the amino acids that indicate benign effects in random peptides expressed in E. coli also do so in an independent dataset of random N-terminal tags in which it is possible to control for expression level. The same amino acids are also enriched in young animal proteins.
Proteins' great potential for harm, via misfolding and aggregation, does not prevent them from being born de novo from non-coding DNA. To investigate how newborn proteins might "first, do no harm", we estimate fitnesses from an experiment that competed Escherichia coli lineages that each expressed a unique random peptide. A variety of peptide metrics significantly predict lineage fitness, but almost all this predictive power stems from simple amino acid composition. Low fitness is predicted by hydrophobic and positively charged residues, positive net charge, aggregation prone regions, and underdispersed hydrophobic residues, while high fitness is predicted by disorder-promoting, negatively charged, small residues, and negative net charge. The same amino acids that predict high fitness in E. coli are enriched in young Pfams in animals, but not in plants. To modify Jacques Monod's famous quote, what makes peptides benign in E.coli also makes them benign in elephants, but not in eucalyptus.
Supplemental figures Supplemental fig. 1. Fully backed up and not fully backed up proteins exhibit substantial overlap in their distributions of 3ʹ UTR lengths, protein abundances, and extension lengths. Supplemental fig. 2. +2 shifted extensions with more 3ʹ UTR ribohits have higher ISD. A) Linear (red) and loess (blue) regressions of square-root transformed extension ISD on log-ribohits (without controlling for length). All regressions are weighted by the length of the extensions, visually represented by point area. The fact that after grafting (right) the curve persists with a similar slope, about 50% as steep, shows that elevated ISD in genes with more readthrough is at least partly driven by amino acids beyond the stop codon. Weighted R 2 = 0.0029 and 0.00079 (Willett and Singer 1988), and in the corresponding unweighted analyses R 2 = 0.0043 and 0.00053, for original and grafted, respectively. B) Ribohits still predict higher ISD (P = 2 × 10 -5 and 5 × 10 -4 for +2 shifted original and grafted extensions, respectively, +2 frame ISD models, supplemental Table 1) after controlling for the large effect of extension length (P = 2 × 10 -57 and 6 × 10 -120 ). "Median" is the line for either the original (left) or grafted (right) +2 shifted ISD model in supplemental Table 1 when there are 13 3ʹ UTR ribohits, which is the median value for fully backed up proteins (see fig. 2), and "None" is for the same model with zero ribohits. Like in fig. 5, weighting by the length of the extensions (see Materials and Methods) means that the ISD values represent expectations from sampling an amino acid from the extensions rather than from sampling an extension.Supplemental fig. 3. +1 shifted extension ISD does not display evidence of pre-adapting selection. A) Linear (red) and loess (blue) regressions of square-root transformed extension ISD on log-ribohits without controlling for length are shown. Both regressions are weighted by the length of the extensions, visually represented by point area. B) After controlling for extension length, log ribohits are not a significant predictor of +1 extension ISD (P > 0.1, +1 shifted original extension ISD model, Table 2). Ribohits and extension length are log transformed, while +1 extension ISD is square-root transformed. Lines for "Median" and "None" are as described in supplemental figure 2, except using the +1 shifted extension ISD model in supplemental Table 2. Note that grafted extensions are not shown for +1 shifted extensions; because the non-grafted (i.e. original) +1 shifted extension ISD do not show a significant effect with ribohits, there was no need to show whether a non-significant effect persisted after controlling for the effects of C-terminal amino acids.Supplemental fig. 4. Unweighted regressions for in-frame extension ISD on ribohits do not display the pronounced downturn past 100 ribohits seen in Fig. 5A, indicating that the downturn is due to weighting by extension length. A) In-frame and B) +2 frame regressions are the same as in fig. 5A and supplemental fig. 2A, resp...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.