We compare and contrast genome-wide compositional biases and distributions of short oligonucleotides across 15 diverse prokaryotes that have substantial genomic sequence collections. These include seven complete genomes (Escherichia coli, Haemophilus influenzae, Mycoplasma genitalium, Mycoplasma pneumoniae, Synechocystis sp. strain PCC6803, Methanococcus jannaschii, and Pyrobaculum aerophilum). A key observation concerns the constancy of the dinucleotide relative abundance profiles over multiple 50-kb disjoint contigs within the same genome. (The profile is XY * ؍ f XY * /f X * f Y * for all XY, where f X * denotes the frequency of the nucleotide X and f XY * denotes the frequency of the dinucleotide XY, both computed from the sequence concatenated with its inverted complementary sequence.) On the basis of this constancy, we refer to the collection { XY * } as the genome signature. We establish that the differences between { XY * } vectors of 50-kb sample contigs of different genomes virtually always exceed the differences between those of the same genomes. Various di-and tetranucleotide biases are identified. In particular, we find that the dinucleotide CpG؍CG is underrepresented in many thermophiles (e.g., M. jannaschii, Sulfolobus sp., and M. thermoautotrophicum) but overrepresented in halobacteria. TA is broadly underrepresented in prokaryotes and eukaryotes, but normal counts appear in Sulfolobus and P. aerophilum sequences. More than for any other bacterial genome, palindromic tetranucleotides are underrepresented in H. influenzae. The M. jannaschii sequence is unprecedented in its extreme underrepresentation of CTAG tetranucleotides and in the anomalous distribution of CTAG sites around the genome. Comparative analysis of numbers of long tetranucleotide microsatellites distinguishes H. influenzae. Dinucleotide relative abundance differences between bacterial sequences are compared. For example, in these assessments of differences, the cyanobacteria Synechocystis, Synechococcus, and Anabaena do not form a coherent group and are as far from each other as general gram-negative sequences are from general gram-positive sequences. The difference of M. jannaschii from low-G؉C gram-positive proteobacteria is one-half of the difference from gram-negative proteobacteria. Interpretations and hypotheses center on the role of the genome signature in highlighting similarities and dissimilarities across different classes of prokaryotic species, possible mechanisms underlying the genome signature, the form and level of genome compositional flux, the use of the genome signature as a chronometer of molecular phylogeny, and implications with respect to the three putative eubacterial, archaeal, and eukaryote domains of life and to the origin and early evolution of eukaryotes.In this report, we describe measures of genomic similarities that do not depend on prior alignment of homologous sequences and apply them to sufficiently large samples of prokaryotic genomic sequences. The approach departs from almost all other metho...
This work assesses relationships for 30 complete prokaryotic genomes between the presence of the ShineDalgarno (SD) sequence and other gene features, including expression levels, type of start codon, and distance between successive genes. A significant positive correlation of the presence of an SD sequence and the predicted expression level of a gene based on codon usage biases was ascertained, such that predicted highly expressed genes are more likely to possess a strong SD sequence than average genes. Genes with AUG start codons are more likely than genes with other start codons, GUG or UUG, to possess an SD sequence. Genes in close proximity to upstream genes on the same coding strand in most genomes are significantly higher in SD presence. In light of these results, we discuss the role of the SD sequence in translation initiation and its relationship with predicted gene expression levels and with operon structure in both bacterial and archaeal genomes.
Strand-symmetric relative abundance functionals for di-, tri-, and tetranusleotides are introduced and applied to sequences encompassing a broad phylogenetic range to discern tendencies and omalies in the occurrences of these short oligonucleotides within and between genomic sequences.For dinucleotides, TA is almost universally under-represented, with the exception of vertebrate mitochondrial genomes, and CG is strongly under-represented in vertebrates and in mitochondrial genomes. The traditional methylation/deamination/mutation hypothesis for the rarity of CG does not adequately account for the observed deficiencies in certain sequences, notably the mitochondrial genomes, yeast, and Neurospora crassa, which lack the standard CpG methylase. Homodinucleotides (AATT, CCGG) and larger homooligonucleotides are over-represented in many organisms, perhaps due to polymerase slippage events. For trinucleotides, GCATGC tends to be under-represented in phage, human viral, and eukaryotic sequences, and CTATAG is strongly under-represented in many prokaryotic, eukaryotic, and viral sequences. The CCA TGG triplet is ubiquitously overrepresented in human viral and eukaryotic sequences. Among the tetranucleotides, several four-base-pair palindromes tend to be under-represented in phage sequences, probably as a means of restriction avoidance. The tetranucleotide CTAG is observed to be rare in virtually all bacterial genomes and some phage genomes. Eplanations for these over-and underrepresentations in terms of DNA/RNA structures and regulatory mechanisms are considered.nents. Similar evaluations are available for characterizing the relative abundances of tri-, tetra-, and higher-order oligonucleotides (see Methods). The DNA sequences examined (Table 1) range from a low G+C frequency of 33% in yeast up to 69o for the bacterium Streptomyces lividans. The relative abundance functionals control for these biases.The CG doublet in vertebrate sequences is a paradigm case of significant under-representation (CpG suppression). It is also well known that TA is under-utilized in the DNA of most organisms. For previous tabulations and analyses of doublet relative abundances, see, e.g., refs. 4-8. The rarity of certain tetranucleotides (e.g., the DAM methylase site GATC and the tetranucleotide CTAG) in some enterobacterial species was highlighted in refs. 9 and 10.In In this paper we commence a detailed study encompassing a broad phylogenetic range with aim to discern tendencies and anomalies in the occurrences of di-, tri-, and tetranucleotides within and between genomic sequences. In particular, we identify extremes of over-and under-representation of short oligonucleotides. Assessments of dinucleotide relative abundance are usually based on an odds ratio measure, where values sufficiently less than 1 (or >1) indicate that a given dinucleotide is under-represented (over-represented) compared with the random union of its mononucleotide compo-METHODS Measures of Over-/Under-representation of Short Oligonucleotides. Letfx denote the...
We review concepts and methods for comparative analysis of complete genomes including assessments of genomic compositional contrasts based on dinucleotide and tetranucleotide relative abundance values, identifications of rare and frequent oligonucleotides, evaluations and interpretations of codon biases in several large prokaryotic genomes, and characterizations of compositional asymmetry between the two DNA strands in certain bacterial genomes. The discussion also covers means for identifying alien (e.g. laterally transferred) genes and detecting potential specialization islands in bacterial genomes.
Our basic observation is that each genome has a characteristic ''signature'' defined as the ratios between the observed dinucleotide frequencies and the frequencies expected if neighbors were chosen at random (dinucleotide relative abundances). The remarkable fact is that the signature is relatively constant throughout the genome; i.e., the patterns and levels of dinucleotide relative abundances of every 50-kb segment of the genome are about the same. Comparison of the signatures of different genomes provides a measure of similarity which has the advantage that it looks at all the DNA of an organism and does not depend on the ability to align homologous sequences of specific genes. Genome signature comparisons show that plasmids, both specialized and broad-range, and their hosts have substantially compatible (similar) genome signatures. Mammalian mitochondrial (Mt) genomes are very similar, and animal and fungal Mt are generally moderately similar, but they diverge significantly from plant and protist Mt sets. Moreover, Mt genome signature differences between species parallel the corresponding nuclear genome signature differences, despite large differences between Mt and host nuclear signatures. In signature terms, we find that the archaea are not a coherent clade. For example, Sulfolobus and Halobacterium are extremely divergent. There is no consistent pattern of signature differences among thermophiles. More generally, grouping prokaryotes by environmental criteria (e.g., habitat propensities, osmolarity tolerance, chemical conditions) reveals no correlations in genome signature.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.