Mathematical models of the generation of genetic texts appeared simultaneously with the first sequencing DNA. They are used to establish functional and evolutionary relations between genetic texts, to predict the number and distribution of specific sites in a sequence and to identify "meaningful" words. The present paper deals with two problems: 1) The significance of deviations from the mean statistical characteristics in a genetic text. Anyone who has addressed himself to the statistical analysis of sequenced DNA is familiar with the question: what deviations from the expected frequencies of occurrence of particular words testify to the "biological" significance of those words? We propose a formula for the variance of the number of word's occurrences in the text, with allowance for word overlaps, making it possible to assess the significance of the deviations from the expected statistical characteristics. 2) A new method for predicting the frequencies of occurrence of particular words in a genetic text using the statistical characteristics of "spaced" L-grams. The method can be used for predicting the number of restriction sites in human DNA and in planning experiments on the physical mapping and sequencing of the human genome.
Words are irregularly distributed in genetic texts. The analysis of this irregularity leads to the notion of stationary and non-stationary words. The polyW and polyS tracts are shown to be the most non-stationary words in genetic texts (here W-[A,T], S-[G,C], a polyW tract is a sequence of A,T nucleotides and a polyS tract is a sequence of G,C nucleotides. The distribution of stationary words suggests a method for partitioning DNA into zones. The zones obtained in the case of the phage are interpreted in the light of the Dowe hypothesis of the modular structure of bacteriophage genomes.
Non-coding RNAs (ncRNAs) participate in various biological processes, including regulating transcription and sustaining genome 3D organization. Here, we present a method termed Red-C that exploits proximity ligation to identify contacts with the genome for all RNA molecules present in the nucleus. Using Red-C, we uncovered the RNA-DNA interactome of human K562 cells and identified hundreds of ncRNAs enriched in active or repressed chromatin, including previously undescribed RNAs. We found two microRNAs-MIR3648 and MIR3687 transcribed from the rRNA locus-that are associated with inactive chromatin genome wide. These miRNAs favor bulk heterochromatin over Polycomb-repressed chromatin and interact preferentially with late-replicating genomic regions. Analysis of the RNA-DNA interactome also allowed us to trace the kinetics of messenger RNA production. Our data support the model of cotranscriptional intron splicing, but not the hypothesis of the circularization of actively transcribed genes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.