Fast alignment-free sequence comparison using spaced-word frequencies

Leimeister, Chris-André; Boden, Marcus; Horwege, Sebastian; Lindner, Sebastian; Morgenstern, Burkhard

doi:10.1093/bioinformatics/btu177

Cited by 127 publications

(140 citation statements)

References 44 publications

Supporting

Mentioning

137

Contrasting

Unclassified

Order By: Relevance

“…However, as focussing on selected longer sequence motifs can still be beneficial for classification, we also recorded the frequencies of the 100 most abundant 8-mers in an independent set of bacterial genomes, scanning both strands and allowing for one mismatch. Spaced words were introduced for the alignment of dissimilar sequences4142. Thus, their incorporation is useful in the context of novel species discovery.…”

Section: Methodsmentioning

confidence: 99%

PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data

Deneke

Rentzsch

Renard

2017

Sci Rep

View full text Add to dashboard Cite

The reliable detection of novel bacterial pathogens from next-generation sequencing data is a key challenge for microbial diagnostics. Current computational tools usually rely on sequence similarity and often fail to detect novel species when closely related genomes are unavailable or missing from the reference database. Here we present the machine learning based approach PaPrBaG (Pathogenicity Prediction for Bacterial Genomes). PaPrBaG overcomes genetic divergence by training on a wide range of species with known pathogenicity phenotype. To that end we compiled a comprehensive list of pathogenic and non-pathogenic bacteria with human host, using various genome metadata in conjunction with a rule-based protocol. A detailed comparative study reveals that PaPrBaG has several advantages over sequence similarity approaches. Most importantly, it always provides a prediction whereas other approaches discard a large number of sequencing reads with low similarity to currently known reference genomes. Furthermore, PaPrBaG remains reliable even at very low genomic coverages. CombiningPaPrBaG with existing approaches further improves prediction results.The vast amount and diversity of bacteria on Earth, together with ever increasing human exposure 1 , suggests that we will be continuously confronted with novel bacterial pathogens, too. Encouragingly, next-generation sequencing (NGS) has emerged as a novel, powerful diagnostic tool in this regard. However, the direct NGS-based characterisation of novel pathogenic strains or even species is still problematic when closely related genomes are unavailable or missing from the respective reference database. Here we introduce a machine learning based approach, PaPrBaG, which overcomes genetic divergence in predicting bacterial pathogenicity by training on a wide range of species with known pathogenicity phenotype. Importantly, even if this is avoided for practical reasons at some points throughout this (and related) work, one may more cautiously speak of pathogenic potential than pathogenicity, given that the latter is ultimately governed by the complex interplay between host (state) and pathogen. Existing MethodsExisting approaches amenable to pathogenicity prediction broadly fall into two classes: protein content based and whole-genome based. Where assembled genomes are available, the presence/absence pattern of certain protein families can be expected to correlate with complex phenotypes, e.g. pathogenicity. This is primarily based on the presence of virulence factors (VFs) -often acquired through horizontal gene transfer 2 -or the absence of more common genes (functions) that become dispensable when e.g. host-specific pathogens evolve from commensal ancestors 3 . Three recent studies rely on these considerations.The BacFier method by Iraola et al. 4 was the first to apply the described approach on a large scale. The authors defined eight VF categories and obtained 814 related VF protein families from KEGG 5 . They further used a set of 848 human-pathogenic (HP) and generally n...

show abstract

Section: Methodsmentioning

confidence: 99%

PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data

Deneke

Rentzsch

Renard

2017

Sci Rep

View full text Add to dashboard Cite

show abstract

“…Below is a spaced-word match between two DNA sequences S 1 and S 2 at (5, 2) with respect to the pattern P = 1100101:

\begin{matrix} S_{1} : & G & C & T & G & T & A & T & A & C & G & T & C \\ S_{2} : & G & T & A & C & A & C & T & T & A & T \\ P : & 1 & 1 & 0 & 0 & 1 & 0 & 1 \end{matrix}

By definition, nucleotides in S 1 and S 2 corresponding to a match position of P are identical, while at the don’t-care positions mismatches are possible. Throughout this paper, we use a single pattern P if two sequences are compared, as opposed to the multiple-pattern approach that we previously used (Leimeister et al , 2014). …”

Section: Algorithmmentioning

confidence: 99%

Fast and accurate phylogeny reconstruction using filtered spaced-word matches

2017

Self Cite

View full text Add to dashboard Cite

MotivationWord-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods.ResultsWe propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes.Availability and ImplementationThe program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

“…a predefined binary pattern of "match positions" and "don't care positions". spaced [84][85][86] is similar to previous methods that compare the k-mer composition of DNA or protein sequences. However, the program uses so-called "spaced words" instead of k-mers.…”

Section: Multi-spammentioning

confidence: 99%

Benchmarking of alignment-free sequence comparison methods

Zieleziński

Girgis

Bernard

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

2 Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference and reconstruction of species trees under horizontal gene transfer and recombination events.The interactive web service allows researchers to explore the performance of alignmentfree tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-theart tools, accelerating the development of new, more accurate AF solutions. BACKGROUNDComparative analysis of DNA and amino acid sequences is of fundamental importance in biological research, particularly in molecular biology and genomics. It is the first and key step in molecular evolutionary analysis, gene function and regulatory region prediction, sequence assembly, homology searching, molecular structure prediction, gene discovery and protein structure-function relationships analysis. Traditionally, sequence comparison was based on pairwise or multiple sequence alignment (MSA). Software tools for sequence alignment, such as BLAST [1] and CLUSTAL [2], are the most widely used bioinformatics methods.Although alignment-based approaches generally remain the references for sequence

show abstract

Fast alignment-free sequence comparison using spaced-word frequencies

Cited by 127 publications

References 44 publications

PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data

PaPrBaG: A machine learning approach for the detection of novel pathogens from NGS data

Fast and accurate phylogeny reconstruction using filtered spaced-word matches

Benchmarking of alignment-free sequence comparison methods

Contact Info

Product

Resources

About