Shannon information in the genomes of all completely sequenced prokaryotes and eukaryotes are measured in word lengths of two to ten letters. It is found that in a scale-dependent way, the Shannon information in complete genomes are much greater than that in matching random sequences--thousands of times greater in the case of short words. Furthermore, with the exception of the 14 chromosomes of Plasmodium falciparum, the Shannon information in all available complete genomes belong to a universality class given by an extremely simple formula. The data are consistent with a model for genome growth composed of two main ingredients: random segmental duplications that increase the Shannon information in a scale-independent way, and random point mutations that preferentially reduces the larger-scale Shannon information. The inference drawn from the present study is that the large-scale and coarse-grained growth of genomes was selectively neutral and this suggests an independent corroboration of Kimura's neutral theory of evolution.
Shannon information (SI) and its special case, divergence, are defined for a DNA sequence in terms of probabilities of chemical words in the sequence and are computed for a set of complete genomes highly diverse in length and composition. We find the following: SI (but not divergence) is inversely proportional to sequence length for a random sequence but is length-independent for genomes; the genomic SI is always greater and, for shorter words and longer sequences, hundreds to thousands times greater than the SI in a random sequence whose length and composition match those of the genome; genomic SIs appear to have word-length dependent universal values. The universality is inferred to be an evolution footprint of a universal mode for genome growth.PACS numbers: PACS number: 87.10.+e, 89.70.+c, 87.14.Gg, 87.23.Kg, Shannon entropy [1] has been used in almost every field concerned with information, including the analysis of DNA and protein sequences [2,3,4,5]. In the field of comparative genomics however it seems not to have found any systematic application. The high heterogeneity of complete genomes in length, base composition and percentage of coding regions may make comparison based on Shannon entropy problematic. Here we show that by simple and appropriate definition of a quantity we call Shannon information (SI), difficulties associated with these issues are surmounted. The SI are applied to characterize distribution of occurrence frequency (FD) of k-mers, or words of a k chemical letters, in complete genomes. We present a simple relation between the SI and the relative spectral width of an FD, a relation that furnishes an intuitive understanding between SI and information in a sequence. We show that in spite of their high heterogeneity, the SIs in complete genomes can be represented by a set of genome independent universal lengths.Divergence and Shannon information. Consider a set of occurrence frequencies,The quantity attains its maximum value H max =ln τ when all f i are equal to the mean frequencyf =τ −1 L. In H, Shannon was concerned with the fidelity of messages as they are transported through communication devices. Here we are interested in the information in F itself. There is a general notion that information in a system increases with a decrease in uncertainty, hence we identify the quantity,called divergence by Gatlin , where the summation is restricted to those f i 's in the subset F m . We define the SI carried by F to be the weighed average of the divergences in the subsets:Frequency distribution in a DNA sequence. We view a single strand of DNA and as a linear text written in the four chemical letters, A, C, G and T representing the four kinds of nucleotides. Empirically genomes are invariably within a few percent of being compositionally self-complementary and, for the present study, it suffices to characterize the base composition of a genome by a single number, p, the combined probability of (A+T). From now on the term profile of a sequence will refer to the p value and the length L of the se...
Shannon information in the genomes of all completely sequenced prokaryotes and eukaryotes are measured in word lengths of two to ten letters. It is found that in a scale-dependent way, the Shannon information in complete genomes are much greater than that in matching random sequences -thousands of times greater in the case of short words. Furthermore, with the exception of the 14 chromosomes of Plasmodium falciparum, the Shannon information in all available complete genomes belong to a universality class given by an extremely simple formula. The data are consistent with a model for genome growth composed of two main ingredients: random segmental duplications that increase the Shannon information in a scale-independent way, and random point mutations that preferentially reduces the larger-scale Shannon information. The inference drawn from the present study is that the large-scale and coarse-grained growth of genomes was selectively neutral and this suggests an independent corroboration of Kimura's neutral theory of evolution.
This information has not been peer-reviewed. Responsibility for the findings rests solely with the author(s). Universality in Large-Scale Structure of Complete Genomes AbstractThe abundance of duplications in genomes in the form of paralogs, pseudogenes and a variety of repeats suggests that genomes may have used duplications as one mode for their growth. However a systematic knowledge on all possible duplications in whole genomes is still lacking. This paper reports the results of a detailed study of occurrence frequencies of short oligonucleotides in all extant complete genomes. We found a systematic pattern of repeats of short oligonucleotides that places all the complete genomes except Plasmodium in a single universality class expressed by an extremely simple formula. Our analysis of the data combined with computer simulation of genome growth models suggest a simple coarse-grain representation of genome growth: the ancestors of the genomes began to grow when they were no greater than 300 b in length via a mechanism whose main components were neutral stochastic segmental replicative translocations and random small mutations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.