AKE - the Accelerated k-mer Exploration web-tool for rapid taxonomic classification and visualization

Langenkämper, Daniel; Goesmann, Alexander; Nattkemper, Tim Wilhelm

doi:10.1186/s12859-014-0384-0

Cited by 8 publications

(4 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Because of its ubiquitous usage for the field, various approaches for each of the domains exist and many of the applications incorporate parallelism. Furthermore, some applications exploiting algorithmic properties for fast computation exist which use e.g., k -mers instead of comparing string sequences, with some also using parallelization techniques (McHardy et al, 2007 ; Langenkämper et al, 2014 ). In database searches for instance, fast alignment algorithms are required that are capable to handle the increase in database sizes and the increase of queries to these databases.…”

Section: Methodsmentioning

confidence: 99%

Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations

et al. 2016

Self Cite

View full text Add to dashboard Cite

Within the recent years clock rates of modern processors stagnated while the demand for computing power continued to grow. This applied particularly for the fields of life sciences and bioinformatics, where new technologies keep on creating rapidly growing piles of raw data with increasing speed. The number of cores per processor increased in an attempt to compensate for slight increments of clock rates. This technological shift demands changes in software development, especially in the field of high performance computing where parallelization techniques are gaining in importance due to the pressing issue of large sized datasets generated by e.g., modern genomics. This paper presents an overview of state-of-the-art manual and automatic acceleration techniques and lists some applications employing these in different areas of sequence informatics. Furthermore, we provide examples for automatic acceleration of two use cases to show typical problems and gains of transforming a serial application to a parallel one. The paper should aid the reader in deciding for a certain techniques for the problem at hand. We compare four different state-of-the-art automatic acceleration approaches (OpenMP, PluTo-SICA, PPCG, and OpenACC). Their performance as well as their applicability for selected use cases is discussed. While optimizations targeting the CPU worked better in the complex k-mer use case, optimizers for Graphics Processing Units (GPUs) performed better in the matrix multiplication example. But performance is only superior at a certain problem size due to data migration overhead. We show that automatic code parallelization is feasible with current compiler software and yields significant increases in execution speed. Automatic optimizers for CPU are mature and usually no additional manual adjustment is required. In contrast, some automatic parallelizers targeting GPUs still lack maturity and are limited to simple statements and structures.

show abstract

Section: Methodsmentioning

confidence: 99%

Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations

et al. 2016

Self Cite

View full text Add to dashboard Cite

show abstract

“…Comparing these frequencies is computationally easier than sequence alignment, and is an important method in alignment-free sequence analysis. The k-mer-based method is implemented in tools such as CLARK [ 71 ] or Kraken [ 31 ], GC-content is used in TAC-ELM [ 72 ], and oligonucleotide frequencies are used in TACOA [ 73 ], MetaID [ 74 ], or AKE [ 75 ]. When building classifiers, these features could eventually be extended by estimated open reading frame (ORF) length or/and density, codon usage, motifs, or repeats, such as microsatellites, transposons, or CRISPRs (clustered regularly interspaced short palindromic repeats) that could help to differentiate viral from non-viral sequences.…”

Section: Data Analysis Pipeline Designmentioning

confidence: 99%

Considerations for Optimization of High-Throughput Sequencing Bioinformatics Pipelines for Virus Detection

Lambert

Braxton

Charlebois

et al. 2018

Viruses

View full text Add to dashboard Cite

High-throughput sequencing (HTS) has demonstrated capabilities for broad virus detection based upon discovery of known and novel viruses in a variety of samples, including clinical, environmental, and biological. An important goal for HTS applications in biologics is to establish parameter settings that can afford adequate sensitivity at an acceptable computational cost (computation time, computer memory, storage, expense or/and efficiency), at critical steps in the bioinformatics pipeline, including initial data quality assessment, trimming/cleaning, and assembly (to reduce data volume and increase likelihood of appropriate sequence identification). Additionally, the quality and reliability of the results depend on the availability of a complete and curated viral database for obtaining accurate results; selection of sequence alignment programs and their configuration, that retains specificity for broad virus detection with reduced false-positive signals; removal of host sequences without loss of endogenous viral sequences of interest; and use of a meaningful reporting format, which can retain critical information of the analysis for presentation of readily interpretable data and actionable results. Furthermore, after alignment, both automated and manual evaluation may be needed to verify the results and help assign a potential risk level to residual, unmapped reads. We hope that the collective considerations discussed in this paper aid toward optimization of data analysis pipelines for virus detection by HTS.

show abstract

“…These features can be used as inputs for machine learning models trained to predict classifications such as the taxonomic designation associated with sequences (Solis-Reyes et al 2018). Machine learning models that operate on k-mer input features have previously been applied in DNA barcode sequence classification and other predictive tasks (Kuksa and Pavlovic 2009;Langenkämper et al 2014;Ainsworth et al 2017;Cordier et al 2017). The application of these tools is often limited to specific taxonomic classification tasks (Kuksa and Pavlovic 2009), or they rely on user-provided sets of sequence data for model training (Langenkämper et al 2014).…”

Section: Introductionmentioning

confidence: 99%

Alignment-free classification of COI DNA barcode data with the Python package Alfie

Nugent¹,

Adamowicz²

2020

MBMG

View full text Add to dashboard Cite

Characterization of biodiversity from environmental DNA samples and bulk metabarcoding data is hampered by off-target sequences that can confound conclusions about a taxonomic group of interest. Existing methods for isolation of target sequences rely on alignment to existing reference barcodes, but this can bias results against novel genetic variants. Effectively parsing targeted DNA barcode data from off-target noise improves the quality of biodiversity estimates and biological conclusions by limiting subsequent analyses to a relevant subset of available data. Here, we present Alfie, a Python package for the alignment-free classification of cytochrome c oxidase subunit I (COI) DNA barcode sequences to taxonomic kingdoms. The package determines k-mer frequencies of DNA sequences, and the frequencies serve as input for a neural network classifier that was trained and tested using ~58,000 publicly available COI sequences. The classifier was designed and optimized through a series of tests that allowed for the optimal set of DNA k-mer features and optimal machine learning algorithm to be selected. The neural network classifier rapidly assigns COI sequences of varying lengths to kingdoms with greater than 99% accuracy and is shown to generalize effectively and make accurate predictions about data from previously unseen taxonomic classes. The package contains an application programming interface that allows the Alfie package’s functionality to be extended to different DNA sequence classification tasks to suit a user’s need, including classification of different genes and barcodes, and classification to different taxonomic levels. Alfie is free and publicly available through GitHub (https://github.com/CNuge/alfie) and the Python package index (https://pypi.org/project/alfie/).

show abstract

AKE - the Accelerated k-mer Exploration web-tool for rapid taxonomic classification and visualization

Cited by 8 publications

References 29 publications

Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations

Comparison of Acceleration Techniques for Selected Low-Level Bioinformatics Operations

Considerations for Optimization of High-Throughput Sequencing Bioinformatics Pipelines for Virus Detection

Alignment-free classification of COI DNA barcode data with the Python package Alfie

Contact Info

Product

Resources

About