Trace quantities of contaminating DNA are widespread in the laboratory environment, but their presence has received little attention in the context of high throughput sequencing. This issue is highlighted by recent works that have rested controversial claims upon sequencing data that appear to support the presence of unexpected exogenous species. I used reads that preferentially aligned to alternate genomes to infer the distribution of potential contaminant species in a set of independent sequencing experiments. I confirmed that dilute samples are more exposed to contaminating DNA, and, focusing on four single-cell sequencing experiments, found that these contaminants appear to originate from a wide diversity of clades. Although negative control libraries prepared from ‘blank’ samples recovered the highest-frequency contaminants, low-frequency contaminants, which appeared to make heterogeneous contributions to samples prepared in parallel within a single experiment, were not well controlled for. I used these results to show that, despite heavy replication and plausible controls, contamination can explain all of the observations used to support a recent claim that complete genes pass from food to human blood. Contamination must be considered a potential source of signals of exogenous species in sequencing data, even if these signals are replicated in independent experiments, vary across conditions, or indicate a species which seems a priori unlikely to contaminate. Negative control libraries processed in parallel are essential to control for contaminant DNAs, but their limited ability to recover low-frequency contaminants must be recognized.
Although the evolutionary significance of gene duplication has long been recognized, it remains unclear what determines gene duplicability. We find protein complexity to be an important determinant because the proportion of unduplicated genes (P) increases with the number of subunits in a protein. However, P is high (>65%) for both monomers and multimers in yeast, but <30% in human except for subunits of large multimers, implying that organismal complexity is a stronger determinant of gene duplicability than protein complexity. The same conclusion is reached from a comparison of family sizes in yeast and human. Despite Ͼ30 years of effort (1), it remains unclear what determines gene duplicability. Protein complexity, defined as the number of subunits in a protein (n), might be an important factor because duplication of a protein subunit may cause dosage imbalance among the subunits of the protein (2, 3) and the chance of imbalance might increase with the number of subunits in a protein. By using yeast data, Papp et al. (3) found that 33% of the single-copy genes (singletons) participate in protein complexes (multimers), whereas this frequency drops to Ϸ21% for genes with three or more paralogues. They therefore concluded that duplication of a subunit of a protein complex is less likely to be successful than duplication of a monomer. However, no monomers were included in their analysis, so the magnitude of difference in survivability between duplication of a monomer and duplication of a protein complex subunit is not known. It is worth emphasizing that duplication of a monomer may also cause dosage imbalance. This may be particularly true for transcription factors, each of which may control many downstream genes. For example, Drosophila embryos produced by mothers with four dosages of bicoid, a maternal morphogen, tend to develop a larger head, and only Ϸ30% of the embryos produced by mothers with six dosages of bicoid are viable (4). Thus, it is important to include monomers. Indeed, we study the relationship between the survivability of a gene duplication and n by classifying proteins into monomers (n ϭ 1), dimers (n ϭ 2), midsize complexes (3 Յ n Յ 10), and large complexes (n Ͼ 10). Another factor that may affect the survivability of duplicate genes is organismal complexity. It was suggested that, for transcription factors, dosage imbalance occurs more frequently in a complex organism than in yeast because of the long regulatory cascades during multicellular development (3). However, a complex organism may actually be more robust against dosage increase than a simple organism (see below). Thus, we also examine this factor by contrasting human with yeast. Here, organismal complexity is loosely defined as the number of different types of cells.Previously we talked about survivability, which may be defined as the probability for a duplicate gene to survive, but adaptive evolution of duplicate genes may also be important. Because, in the end, we see only whether a gene has been duplicated or not, we will use gene...
BackgroundTrace quantities of contaminating DNA are widespread in the laboratory environment, but their presence has received little attention in the context of high throughput sequencing. This issue is highlighted by recent works that have rested controversial claims upon sequencing data that appear to support the presence of unexpected exogenous species.ResultsI used reads that preferentially aligned to alternate genomes to infer the distribution of potential contaminant species in a set of independent sequencing experiments. I confirmed that dilute samples are more exposed to contaminating DNA, and, focusing on four single-cell sequencing experiments, found that these contaminants appear to originate from a wide diversity of clades. Although negative control libraries prepared from "blank" samples recovered the highest-frequency contaminants, low-frequency contaminants, which appeared to make heterogeneous contributions to samples prepared in parallel within a single experiment, were not well controlled for. I used these results to show that, despite heavy replication and plausible controls, contamination can explain all of the observations used to support a recent claim that complete genes pass from food to human blood.ConclusionsContamination must be considered a potential source of signals of exogenous species in sequencing data, even if these signals are replicated in independent experiments, vary across conditions, or indicate a species which seems a priori unlikely to contaminate. Negative control libraries processed in parallel are essential to control for contaminant DNAs, but their limited ability to recover low-frequency contaminants must be recognized.
The clustering of transcription factor binding sites in developmental enhancers and the apparent preferential conservation of clustered sites have been widely interpreted as proof that spatially constrained physical interactions between transcription factors are required for regulatory function. However, we show here that selection on the composition of enhancers alone, and not their internal structure, leads to the accumulation of clustered sites with evolutionary dynamics that suggest they are preferentially conserved. We simulated the evolution of idealized enhancers from Drosophila melanogaster constrained to contain only a minimum number of binding sites for one or more factors. Under this constraint, mutations that destroy an existing binding site are tolerated only if a compensating site has emerged elsewhere in the enhancer. Overlapping sites, such as those frequently observed for the activator Bicoid and repressor Krüppel, had significantly longer evolutionary half-lives than isolated sites for the same factors. This leads to a substantially higher density of overlapping sites than expected by chance and the appearance that such sites are preferentially conserved. Because D. melanogaster (like many other species) has a bias for deletions over insertions, sites tended to become closer together over time, leading to an overall clustering of sites in the absence of any selection for clustered sites. Since this effect is strongest for the oldest sites, clustered sites also incorrectly appear to be preferentially conserved. Following speciation, sites tend to be closer together in all descendent species than in their common ancestors, violating the common assumption that shared features of species' genomes reflect their ancestral state. Finally, we show that selection on binding site composition alone recapitulates the observed number of overlapping and closely neighboring sites in real D. melanogaster enhancers. Thus, this study calls into question the common practice of inferring “cis-regulatory grammars” from the organization and evolutionary dynamics of developmental enhancers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.