2002
DOI: 10.1073/pnas.202468099
|View full text |Cite
|
Sign up to set email alerts
|

Distributional regimes for the number ofk-word matches between two random sequences

Abstract: When comparing two sequences, a natural approach is to count the number of k-letter words the two sequences have in common. No positional information is used in the count, but it has the virtue that the comparison time is linear with sequence length. For this reason this statistic D2 and certain transformations of D2 are used for EST sequence database searches. In this paper we begin the rigorous study of the statistical distribution of D2. Using an independence model of DNA sequences, we derive limiting distr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
132
0

Year Published

2007
2007
2014
2014

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 107 publications
(140 citation statements)
references
References 25 publications
3
132
0
Order By: Relevance
“…Previous studies of the D 2 statistic used Kolmogorov-Smirnov tests [3] to compare the empirical distribution of D 2 with its theoretical asymptotic distributions (normal or compound-Poisson) [7,2]. These studies, however, have been in error for the following reason.…”
Section: Comparison Between Empirical and Hypothesised Distributionsmentioning
confidence: 99%
See 1 more Smart Citation
“…Previous studies of the D 2 statistic used Kolmogorov-Smirnov tests [3] to compare the empirical distribution of D 2 with its theoretical asymptotic distributions (normal or compound-Poisson) [7,2]. These studies, however, have been in error for the following reason.…”
Section: Comparison Between Empirical and Hypothesised Distributionsmentioning
confidence: 99%
“…For pairs of Bernoulli texts with non-uniform letter distributions, the limiting distribution is compound Poisson in the regime k > 2 log b n + const. [7], and normal in the regime k < 1/2 log b n + const [2]. Here b = p 2 −1 .…”
Section: Introductionmentioning
confidence: 99%
“…Repeat structure in large genomes has been analyzed without first constructing consensus repeat family sequences [11,12], including the use of oligonucleotide (hereafter "oligo") or lmer similarity, rather than sequence similarity [13,14], and analytical counting methods, such as RAP [15] and the method of Healy and colleagues [16]. There has been some statistical evaluation of oligo-based repeat region identification using these methods [15,16], but no comprehensive genomic annotation approaches have been developed for oligo-based repeat analysis.…”
Section: Introductionmentioning
confidence: 99%
“…A.2 of Appendix A.6) additional computation time is required for the identi¯cation and classi¯cation of each maxmer and is proportional to the number of occurrences of this maxmer. In the worst case the number of occurrences of a given maxmer may be approximated by the integer value k at the maximum of a Poisson distribution, p ðkÞ ¼ k e À =ðk!Þ, where the interval is taken to be the sequence size n 45,46 ; however, since this quantity can only be determined empirically, the overall computation time order is estimated based on real computations. The full run time of the length distribution computation is plotted against the genome sequence size in Fig.…”
Section: Computational Complexitymentioning
confidence: 99%