Benchmarking of alignment-free sequence comparison methods

Zieleziński, Andrzej; Girgis, Hani Z.; Bernard, Guillaume; Leimeister, Chris-André; Tang, Kujin; Dencker, Thomas; Lau, Anna Katharina; Röhling, Sophie; Choi, Jiyoun; Waterman, Michael S.; Comin, Matteo; Kim, Sung‐Hou; Vinga, Susana; Almeida, Jonas S.; Chan, Cheong Xin; James, Benjamin; Sun, Fengzhu; Morgenstern, Burkhard; Karłowski, Wojciech M.

doi:10.1101/611137

Cited by 31 publications

(46 citation statements)

References 95 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For such data, standard two-phase methods that first compute an alignment and then compute a tree do not have acceptable accuracy, while PASTA [32], BAli-Phy [33], and other co-estimation methods are not fast. It is possible that alignment-free methods (see [34,35,36] for an entry into this topic) might provide good starting trees, but these have not been tested on ultra-large datasets (with thousands of species), and have instead mainly been focused on genome-scale analyses of tens of genomes. However, for any large dataset on which the starting trees cannot be reasonably accurately estimated quickly, blended DTM divide-and-conquer strategies may provide the best accuracy.…”

Section: Discussionmentioning

confidence: 99%

Unblended Disjoint Tree Merging using GTM improves species tree estimation

Smirnov

Warnow

2019

Preprint

View full text Add to dashboard Cite

AbstractPhylogeny estimation is an important part of much biological research, but large-scale tree estimation is infeasible using standard methods due to computational issues. Recently, an approach to large-scale phylogeny has been proposed that divides a set of species into disjoint subsets, computes trees on the subsets, and then merges the trees together using a computed matrix of pairwise distances between the species. The novel component of these approaches is the last step: Disjoint Tree Merger (DTM) methods. We present GTM (Guide Tree Merger), a polynomial time DTM method that adds edges to connect the subset trees, so as to provably minimize the topological distance to a computed guide tree. Thus, GTM performs unblended mergers, unlike the previous DTM methods. Yet, despite the potential limitation, our study shows that GTM has excellent accuracy, generally matching or improving on two previous DTMs, and is much faster than both. Thus, the GTM approach to the DTM problem is a useful new tool for large-scale phylogenomic analysis, and shows the surprising potential for unblended DTM methods. The software for GTM is available at https://github.com/vlasmirnov/GTM.

show abstract

Section: Discussionmentioning

confidence: 99%

Unblended Disjoint Tree Merging using GTM improves species tree estimation

Smirnov

Warnow

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…While variants are typically discovered with short reads by mapping them to a target reference genome, one can also directly compare common subsequences among samples (Zielezinski et al, 2019) .…”

Section: Introductionmentioning

confidence: 99%

Finding genetic variants in plants without complete genomes

Voichek¹,

Weigel²

2019

Preprint

View full text Add to dashboard Cite

Structural variants and presence/absence polymorphisms are common in plant genomes, yet they are routinely overlooked in genome-wide association studies (GWAS). Here, we expand the genetic variants detected in GWAS to include major deletions, insertions, and rearrangements. We first use raw sequencing data directly to derive short sequences, k -mers, that mark a broad range of polymorphisms independently of a reference genome. We then link k -mers associated with phenotypes to specific genomic regions. Using this approach, we re-analyzed 2,000 traits measured in Arabidopsis thaliana, tomato , and maize populations. Associations identified with k -mers recapitulate those found with single-nucleotide polymorphisms (SNPs), however, with stronger statistical support. Moreover, we identified new associations with structural variants and with regions missing from reference genomes.Our results demonstrate the power of performing GWAS before linking sequence reads to specific genomic regions, which allow detection of a wider range of genetic variants responsible for phenotypic variation.Here, we present an efficient method for k -mer-based GWAS and compare it directly to the conventional SNP-based approach on more than 2,000 phenotypes from three plant species with different genome and population characteristics -A. thaliana , maize and tomato. Most variants identified by SNPs can be detected with k -mers (and vice versa), but k -mers having stronger statistical support.For k -mer-only hits, we demonstrate how different strategies can be used to infer their genomic context, including large structural variants, sequences missing from the reference genome, and organeller variants. Lastly, we compute population structure directly from k -mers, enabling the analysis of species with poor quality or without a reference genome. In summary, we have inverted the conventional approach of building a genome, using it to find population variants, and only then associating variants with phenotypes. In contrast, we begin by associating sequencing reads with phenotypes, and only then infer the genomic context of these sequences. We posit that this change of order is especially effective in plant species, for which defining the full population-level genetic variation based on reference genomes remains highly challenging. Schneeberger et al., 2009) . While traditional GWAS methods will benefit from these technological improvements, so will k -mer based approaches, which will be able to use tags spanning larger genomic distances. Therefore, we posit that for GWAS purposes, k -mer based approaches are ideal because they minimize arbitrary choices when classifying alleles and because they capture more, almost optimal, information from raw sequencing data. 578

show abstract

“…• Alignment based methods: These involve either shifting or insertion of gaps in sequences for alignment of two or more sequences, which make these methods computationally intensive. • Alignment-free methods: These are computationally less intensive methods that consider the genome sequences as character strings and use distance-based methods involving frequency and distribution of bases [12][13][14]. Our focus in this paper is on alignment-free methodology, especially on using complexity measures for sequence comparisons.…”

Section: Genome Sequence Comparisonmentioning

confidence: 99%

Automatic Identification of SARS Coronavirus using Compression-Complexity Measures

Balasubramanian

Nagaraj

2020

Preprint

View full text Add to dashboard Cite

Finding vaccine or specific antiviral treatment for global pandemic of virus diseases (such as the ongoing COVID-19) requires rapid analysis, annotation and evaluation of metagenomic libraries to enable a quick and efficient screening of nucleotide sequences. Traditional sequence alignment methods are not suitable and there is a need for fast alignment-free techniques for sequence analysis. Information theory and data compression algorithms provide a rich set of mathematical and computational tools to capture essential patterns in biological sequences. In 2013, our research group (Nagaraj et al., Eur. Phys. J. Special Topics 222(3-4), 2013) has proposed a novel measure known as Effort-To-Compress (ETC) based on the notion of compression-complexity to capture the information content of sequences. In this study, we propose a compression-complexity based distance measure for automatic identification of SARS coronavirus strains from a set of viruses using only short fragments of nucleotide sequences. We also demonstrate that our proposed method can correctly distinguish SARS-CoV-2 from SARS-CoV-1 viruses by analyzing very short segments of nucleotide sequences. This work could be extended further to enable medical practitioners in automatically identifying and characterizing SARS coronavirus strain in a fast and efficient fashion using short and/or incomplete segments of nucleotide sequences. Potentially, the need for sequence assembly can be circumvented.NoteThe main ideas and results of this research were first presented at the International Conference on Nonlinear Systems and Dynamics (CNSD-2013) held at Indian Institute of Technology, Indore, December 12, 2013. In this manuscript, we have extended our preliminary analysis to include SARS-CoV-2 virus as well.

show abstract

Benchmarking of alignment-free sequence comparison methods

Cited by 31 publications

References 95 publications

Unblended Disjoint Tree Merging using GTM improves species tree estimation

Unblended Disjoint Tree Merging using GTM improves species tree estimation

Finding genetic variants in plants without complete genomes

Automatic Identification of SARS Coronavirus using Compression-Complexity Measures

Contact Info

Product

Resources

About