We describe several protein sequence statistics designed to evaluate distinctive attributes ofresidue content and arrangement in primary structure. Considered are global compositional biases, local clustering of different residue types (e.g., charged residues, hydrophobic residues, Ser/Thr), long runs of charged or uncharged residues, periodic patterns, counts and distribution of homooligopeptides, and unusual spacings between particular residue types. The computer program SAPS (statistical analysis of protein sequences) calculates all the statistics for any individual protein sequence input and is available for the UNIX environment through electronic mail on request to V.B. (volker@gnomic.stanford.edu).Newly derived protein sequences are commonly subjected to standard sequence analysis involving identification of similarities to other proteins in data bases, prediction of secondary structure, hydropathy plots, and mapping of potential glycosylation and phosphorylation sites and other motifs (see, e.g., refs.
Determination of first-and second-order Markov chain homogeneity of sets of nuclear eukaryotic DNA sequences, both coding and noncoding, finds similarities imperceptible to the standard Needleman-Wunsch base matching or dot-matrix algorithms. These measures of the similarities of the distributions of adjacent pairs or triplets are in agreement with accepted evolutionary-tree topologies. Hierarchical clustering of the distributions of doublets of 30 miscellaneous coding sequences gives clusters in reasonable agreement with accepted biological classifications. In addition to similarity by homology, there is also observed similarity of disparate genes in the same organism-for example, all three disparate yeast genes (two enzymes and actin) form a well-distinguished cluster.Sixty-four miscellaneous eukaryotic DNA sequences, half coding and half noncoding, have recently been examined as expressions of first-, second-, or third-order Markov chains (1). Standard statistical tests (2, 3) found that 61/64 required at least a first-order Markov chain (that is, not zeroth order) for their expression, 37/64 required at least second order, and a few required at least third order. For 64/64 sequences, the one-step second-order transition count matrix (counts of consecutive triples of bases in sites 1, 2, and 3) made a better prediction from the occupants of sites 1, 2, and 3 of the occupant at site 4 than did the assumption of random occupancy at site 4, and similarly for 56/64 sequences at site 5. The statistical papers (2, 3) also provide tests of the homogeneity of several realizations of Markov chains of a given order-that is, tests of the hypothesis that the several observed sequences are samples generated by the same stochastic Markov chain. This paper presents the results of such tests ofhomogeneity ofMarkov chain representations of orders 1 and 2 for selected sets of sequences and compares the results with the Needleman-Wunsch base-matching (4) and dot-matrix (5) assessments of sequence similarity. METHODThe information theoretic test (3) of the homogeneity of n samples of transition matrices of order 1-that is, of distributions of pairs of contiguous bases, is as follows: n 4 4
Genomic homogeneity is investigated for a broad base of DNA sequences in terms of dinucleotide relative abundance distances (abbreviated f-distances) and of oligonucleotide compositional extremes. It is shown that 8-distances between different genomic sequences in the same species are low, only about 2 or 3 times the distance found in random DNA, and are generally smaUler than the between-species 8-distances. There are many expressions of genomic heterogeneity: (i) local and global variations in C+G content; (ii) distinctive direct and inverted repeats, such as REP sequences in Escherichia coli (1), telomeric repeats, satellite DNA, and multigene families; (iii) transposable elements, such as IS in E. coli, Ty in yeast, Alu and LINES in human (2); (iv) methylation influences (3); (v) oligonucleotide relative abundance extremes, such as underrepresentation of the dinucleotide TpA (4, 5) and of the tetranucleotide CTAG in many eubacteria (5, 6); (vi) a myriad ofcontrol elements (e.g., promoter, enhancer, and termination signals), origins of replication (e.g., automously replicating sequences), and repair recognition sites (e.g., Dam (Tables 3 and 7), and 21 bacterial DNA sets mostly at least 100 kb long (Table 6). Individual species sequences were combined into aggregations of about 100 kb. A sample sequence is designated long when composed from contigs each of length -10 kb and designated short when composed from contigs of <10 kb. The current human genome collection includes 21 contigs of length 30-180 kb. These long contigs were joined, creating 10 long samples of lengths 100-125 kb.Dinucleotide Relative Abundance Values. A common assessment of dinucleotide bias is through the odds ratio pxy = fxy/fxfy, where fx denotes the frequency of the nucleotide X andfxy is the frequency ofthe dinucleotide XY. The formulafor pxy is modified to accommodate double-stranded DNA by calculating the odds ratio for the given DNA sequence combined with its inverted complement sequence. This changesfA, the frequency of the mononucleotide A, to fA = = (fVA + fT)/2, and similarlyfc =fG = (fc +fG)/2. Also,fGT = (fGT + fAC)/2, etc. The (symmetrized) dinucleotide odds ratio measure for double-stranded DNA is PAC = PGT = fGT/fGfT and similarly for all dinucleotides. The deviation of PGT from 1 can be construed as an assessment of dinucleotide bias of GT/AC (7). A corresponding trinucleotide measure is vXy = fxyzfYfz/fXyfyzf' z, where N is any nucleotide. Higher-order measures for longer oligonucleotides are also available (8). Dinucleotide relative abundances effectively assess contrasts between observed dinucleotide frequencies and those expected from the component mononucleotide frequencies. Similarly, trinucleotide relative abundances appropriately discount the influences ofmono-and dinucleotide frequencies, and correspondingly higher-order oligonucleotide relative abundances factor out all lower-order oligonucleotide frequencies.Dinucleotide Relative Abundance Distance. We use a measure of dinucleotide distance between tw...
The random packing of geometric objects in one-, two- or three-dimensions may afford useful insights into the structure of crystals, liquids, absorbates on crystals, and in higher dimensions, into problems of pattern recognition. Random packing has accordingly received increasing attention in recent years. Two principal packing procedures have been formulated and each gives rise to different packing ratios. In one case, all possible configurations of a sphere-packed volume are assumed to be equally likely. In the other and most widely reported case, there is random sequential addition of spheres to the volume until it is packed. This is the situation we study in this paper. Most of the work to date has been limited to the theoretical study of the one-dimensional lattice or to continuous cases particularly in the limit for long lines. The higher dimensional cases have resisted theoretical attack but have been studied by computer simulation by Palasti [12] and Solomon [14] and by physical simulation by Bernal and Scott (see [14]).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.