Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences

Wu, Ta Jen; Huang, Ying-Hsueh; Li, Lung-An

doi:10.1093/bioinformatics/bti658

Cited by 54 publications

(43 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A pairwise complete genome alignment routinely takes 2 h; thus, a 20-s MUMi calculation to preselect appropriate genomes should prove to be a convenient tool. A similar preselection strategy is used in large-scale BLAST alignments of proteins or genes and is based on an estimation of word dissimilarity (32,33). MUMi will also be valuable for fine tuning the parameters of software used for such alignments, e.g., MGA (17), MAUVE (6), or M-GCAT (30).…”

Section: Discussionmentioning

confidence: 99%

A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera

Deloger¹,

Karoui²,

Petit³

2009

J Bacteriol

145

120

View full text Add to dashboard Cite

The fundamental unit of biological diversity is the species. However, a remarkable extent of intraspecies diversity in bacteria was discovered by genome sequencing, and it reveals the need to develop clear criteria to group strains within a species. Two main types of analyses used to quantify intraspecies variation at the genome level are the average nucleotide identity (ANI), which detects the DNA conservation of the core genome, and the DNA content, which calculates the proportion of DNA shared by two genomes. Both estimates are based on BLAST alignments for the definition of DNA sequences common to the genome pair. Interestingly, however, results using these methods on intraspecies pairs are not well correlated. This prompted us to develop a genomic-distance index taking into account both criteria of diversity, which are based on DNA maximal unique matches (MUM) shared by two genomes. The values, called MUMi, for MUM index, correlate better with the ANI than with the DNA content. Moreover, the MUMi groups strains in a way that is congruent with routinely used multilocus sequence-typing trees, as well as with ANI-based trees. We used the MUMi to determine the relatedness of all available genome pairs at the species and genus levels. Our analysis reveals a certain consistency in the current notion of bacterial species, in that the bulk of intraspecies and intragenus values are clearly separable. It also confirms that some species are much more diverse than most. As the MUMi is fast to calculate, it offers the possibility of measuring genome distances on the whole database of available genomes.

show abstract

Section: Discussionmentioning

confidence: 99%

A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera

Deloger¹,

Karoui²,

Petit³

2009

J Bacteriol

145

120

View full text Add to dashboard Cite

show abstract

“…Note, for comparison of large chromosomes we have used a simplified 2-letter alphabet. The block method is similar to that described by Wu et al (17). When sequences a and b are compared, each sequence is divided into m length blocks.…”

Section: Removal Of High Frequency and Low Complexity Featuresmentioning

confidence: 99%

Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions

Sims

Jun

et al. 2009

Proc. Natl. Acad. Sci. U.S.A.

377

386

View full text Add to dashboard Cite

For comparison of whole-genome (genic ؉ nongenic) sequences, multiple sequence alignment of a few selected genes is not appropriate. One approach is to use an alignment-free method in which feature (or l-mer) frequency profiles (FFP) of whole genomes are used for comparison-a variation of a text or book comparison method, using word frequency profiles. In this approach it is critical to identify the optimal resolution range of l-mers for the given set of genomes compared. The optimum FFP method is applicable for comparing whole genomes or large genomic regions even when there are no common genes with high homology. We outline the method in 3 stages: (i) We first show how the optimal resolution range can be determined with English books which have been transformed into long character strings by removing all punctuation and spaces. (ii) Next, we test the robustness of the optimized FFP method at the nucleotide level, using a mutation model with a wide range of base substitutions and rearrangements. (iii) Finally, to illustrate the utility of the method, phylogenies are reconstructed from concatenated mammalian intronic genomes; the FFP derived intronic genome topologies for each l within the optimal range are all very similar. The topology agrees with the established mammalian phylogeny revealing that intron regions contain a similar level of phylogenic signal as do coding regions.mammalian genome phylogeny ͉ whole-genome comparison ͉ whole-genome phylogeny ͉ whole-intron phylogeny T he comparison of 2 closely related genomes at the base-by-base nucleotide sequence level is accomplished by sequence alignment. However, because species diverge extensively over time, insertions/deletions and genomic rearrangements make straightforward sequence alignment unreliable or impossible. This difficulty is typically overcome by 1 of 2 methods. The first involves extracting a common subset of genes (coding sequences) shared by all of the species compared, then building a multiple sequence alignment (MSA) for each gene, and finally concatenating each alignment into a super MSA (1). The MSA and an appropriate base-substitution model are used to calculate similarity scores. The second method is best described as gene profiling, where the occurrence of each gene in a dictionary of genes is counted, forming a gene presence/ absence profile. The relative frequency difference between genomes from their gene profiles is used to derive a similarity score (2). Both methods rely on the correct definition and selection of common genes to be compared, and significant homology among aligned gene sequences.If, however, the genomes do not share an alignable set of common genes, the alignment-free method is the only option of choice at present. Also, these methods of comparison strictly focus on comparing the coding (coding for protein, and functional RNA) portions of genomes, which can amount to as little as 1% of the genomic sequence in humans (3). As for the noncoding sequence of the genome (the other 99%), much of its function is unknown, ...

show abstract

“…Furthermore, let γ(s, q) be the required CPU In this section, the D2 value between the encoded library sequence s with the length 50 50,000 N   and query with the length 300 and word size k = 3 are computed to examine the efficieny of the proposed approach by capturing total execution time ∆. One common approach to find the dissimilarity between k-tuples in D2 statistics is to take the minimum of all window distances for each pair ( ( ), ( )) W W L W Q  , where () WL and () WQ are the k-tuples in library sequence and query sequence respectively [19]. In Fig.…”

Section: A D2 String Comparison Vs Proposed Approach B Efficiency mentioning

confidence: 99%

Efficient Sequence Comparison Using Binary Codes

Rahimi¹

2015

IJFCC

View full text Add to dashboard Cite

Abstract-In this paper, we propose an efficient way of finding the exact distance in sequence comparison by using Huffman coding method for alphabets with uniform symbol probabilities. The approach is proposed as a refinement for word pair comparison in D2 statistics, though it can readily be generalised. Two given sequences with identical lengths are encoded to Huffman binary codes by which we are able to calculate Hamming Distance using binary operations efficiently. This method is applied on D2 statistics to compare k-tuples faster than its original version. The evaluation on emprical sequences showed that the method is faster than original D2; especially, when re-using the encoded sequences which resulted in better performance.

show abstract

Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences

Abstract: The algorithm SK-LD, estimate beta and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu

Cited by 54 publications

References 23 publications

A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera

A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera

Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions

Efficient Sequence Comparison Using Binary Codes

Contact Info

Product

Resources

About