The paper presents a novel approach to study a nucleotide sequence structure with respect to the chloroplast genome DNA sequence analysis. A specific frequencies distribution pattern of the consecutive triple nucleotide fragments was identified in the chloroplast genome DNA sequence, which demonstrated a non-degenerated pattern with seven clusters. Keywords: chloroplast genome, complexity, frequency dictionary, order, phase, triplet. DOI: 10.17516/1997-1389 Krutovsky et al., 2014;Bondar et al., 2015;Sadovsky et al., 2015). This sequence consisted of 122 561 symbols or letters from the four-letter alphabet of 122 561 symbols or letters from the four-letter alphabet. Neither other symbols, nor blank spaces are supposed to be found in a sequence; a sequence under consideration is also supposed to be coherent (i.e., consisting of a single piece).An identification and search of structures in DNA sequence is a main objective of mathematical bioinformatics, biophysics and related scientific fields, including computer programming and information theory. Structures observed within a sequence reveal an order and provide easier understanding of functional roles of a sequence or its fragments. A new function (or a connection between function and structure, or taxonomy) might be discovered through a search for new patterns in symbol sequences corresponding to DNA molecule.It is a commonly accepted fact that nucleotide sequences are rather inhomogeneous in terms of a structuredness that is demonstrated in this paper. In particular, any genome sequence roughly comprises two types of subsequences: coding and non-coding ones, respectively. These subsequences usually do not overlap, while their concatenation yields the . Neither other symbols, nor blank spaces are supposed to be found in a sequence; a sequence under consideration is also supposed to be coherent (i.e., consisting of a single piece).
Materials and Methods
ConceptFirst, we partitioned symbol sequences (that were the chloroplast genomes) for a set of overlapping fragments as long as 303 symbols (nucleotides), starting from the first symbol (nucleotide) at the sequence and then with a shifting window step of 10 symbols (nucleotides) alongside the chloroplast genome sequence. Second, for each fragment in the series described above, a special frequency dictionary was developed. Third, the ensemble of the dictionaries (that was a set of the points in the 63-dimensional Euclidian space) was clustered using the K-means technique (Fukunaga, 1990;Mirkes et al., 2013). Forth, the distribution of those fragments over an elastic map is studied Zinovyev, 2009, 2010;Gorban et al., 2008).Finally, a correlation of the fragments belonging to different classes obtained though K-means and elastic map implementation to the functionally charged regions of the genome is studied.
Sequence dataThe chloroplast genome sequences were
Lattice and DictionaryIn earlier studies (Bugaenko et al., 1996(Bugaenko et al., , 1997(Bugaenko et al., , 1998Hu and Wang, 2001), it was demonstrated tha...