Statistical distance between texts and filtration methods in sequence comparison

Pevzner, Pavel A.

doi:10.1093/bioinformatics/8.2.121

Cited by 16 publications

(9 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If the imprint of the global signature is locally pervasive, down to the scale of the single gene or coding sequence, large deviations on that scale could highlight segments introduced by recent horizontal transfer from another species [13]. So-called filtration methods, based on dissimilarity measures computed from dinucleotide counts, have been employed for the alignment-free computation of evolutionary distances between homologous sequences [14,15]. The "transition matrix method" was a similar technique involving raw counts of amino acid pairs in protein primary sequences [16].…”

Section: Introductionmentioning

confidence: 99%

Pervasive properties of the genomic signature

Jernigan

Baran

2002

BMC Genomics

View full text Add to dashboard Cite

Background: The dinucleotide relative abundance profile can be regarded as a genomic signature because, despite diversity between species, it varies little between 50 kilobase or longer windows on a given genome. Both the causes and the functional significance of this phenomenon could be illuminated by determining if it persists on smaller scales. The profile is computed from the base step "odds ratios" that compare dinucleotide frequencies to those expected under the assumption of stochastic equilibrium (thorough shuffling). Analysis is carried out on 22 sequences, representing 19 species and comprised of about 53 million bases all together, to assess stability of the signature in windows ranging in size from 50 kilobases down to 125 bases.

show abstract

Section: Introductionmentioning

confidence: 99%

Pervasive properties of the genomic signature

Jernigan

Baran

2002

BMC Genomics

View full text Add to dashboard Cite

show abstract

“…Current methods (reviewed in Pevzner, 1992) typically require two arbitrary assumptions to be made for each similarity search: one about the length of the longest common word that is to be considered and the other about the threshold of similarity for significant matches. The method proposed in this paper removes the need for any restrictions on word length while keeping the computation time linear, and it also provides a bound on significance, thus removing need for any arbitrary thresholds.…”

Section: Resultsmentioning

confidence: 99%

Untitled

Milosavljević¹

1995

Machine Learning

View full text Add to dashboard Cite

Abstract. Algorithmic mutual information is a central concept in algorithmic information theory and may be measured as the difference between independent and joint minimal encoding lengths of objects; it is also a central concept in Chaitin's fascinating mathematical definition of life. We explore applicability of algorithmic mutual information as a tool for discovering dependencies in biology. In order to determine significance of discovered dependencies, we extend the newly proposed algorithmic significance method. The main theorem of the extended method states that d bits of algorithmic mutual information imply dependency at the significance level 2-d+O(1). We apply a heuristic version of the method to one of the main problems in DNA and protein sequence comparisons: the problem of deciding whether observed similarity between sequences should be explained by their relatedness or by the mere presence of some shared internal structure, e.g., shared internal repetitive patterns. We take advantage of the fact that mutual information factors out sequence similarity that is due to shared internal structure and thus enables discovery of truly related sequences. In addition to providing a general framework for sequence comparisons, we also propose an efficient way to compare sequences based on their subword composition that does not require any a priori assumptions about k-tuple length.

show abstract

“…Since then, it has been applied in phylogenetic reconstruction [8][9][10][11], identification of homologous proteins [4], genome annotation [12], classification of metagenomic sequences [13], and identification of regulatory sequences [14]. Also, it has been shown as an efficient technique for sequence filtering [15].…”

Section: /32mentioning

confidence: 99%