Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models

Abdallah, Mustafa; Mahgoub, Ashraf; Ahmed, Hany; Chaterji, Somali

doi:10.1038/s41598-019-52196-4

Cited by 10 publications

(13 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The random forests hereby replace our hand-crafted conditions to decide whether a specific position in a read should be modified. This is in contrast to previous recent machine learning approaches like Athena [ 18 ] and Lerna [ 19 ] which try to find optimal input parameters for existing correction algorithms. Third, the algorithm has been optimized to reduce both runtime and memory consumption on both CPUs and GPUs.…”

Section: Introductionmentioning

confidence: 77%

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

2022

View full text Add to dashboard Cite

Background Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. Results We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0’s hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. Conclusion False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE.

show abstract

Section: Introductionmentioning

confidence: 77%

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

2022

View full text Add to dashboard Cite

show abstract

“…Benchmark Data: We release our database corpus (4 datasets) and codes for the community to access it for anomaly detection and defect type classification and to build on it with new datasets and models. 2 We are unveiling real failures of a pharmaceutical packaging manufacturer company.…”

Section: Rpm Selection and Aggregationmentioning

confidence: 99%

“…• ML-based forecasting models: We selected six popular time series forecasting models, including Recurrent Neural Network (RNN) [45], LSTM [18] (which is a better version than RNN and has been used in different applications [2,19]), Deep Neural Netowrk (DNN) [40], AutoEncoder [12], and the recent works DeepAR [38], DeepFactors [52].…”

Section: Temporal Anomaly Detectionmentioning

confidence: 99%

Anomaly Detection and Inter-Sensor Transfer Learning on Smart Manufacturing Datasets

Abdallah¹,

Joung²,

Lee³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Smart manufacturing systems are being deployed at a growing rate because of their ability to interpret a wide variety of sensed information and act on the knowledge gleaned from system observations. In many cases, the principal goal of the smart manufacturing system is to rapidly detect (or anticipate) failures to reduce operational cost and eliminate downtime. This often boils down to detecting anomalies within the sensor date acquired from the system. The smart manufacturing application domain poses certain salient technical challenges. In particular, there are often multiple types of sensors with varying capabilities and costs. The sensor data characteristics change with the operating point of the environment or machines, such as, the RPM of the motor. The anomaly detection process therefore has to be calibrated near an operating point. In this paper, we analyze four datasets from sensors deployed from manufacturing testbeds. We evaluate the performance of several traditional and ML-based forecasting models for predicting the time series of sensor data. Then, considering the sparse data from one kind of sensor, we perform transfer learning from a high data rate sensor to perform defect type classification. Taken together, we show that predictive failure classification can be achieved, thus paving the way for predictive maintenance.

show abstract

“…Popular EC tools such as Lighter [18] and LoRDEC [11] for short-(<400 base pairs) and long-read sequences (>400 base pairs) respectively, require the user to select the k-value. Determining a favorable k-value among possible ones has been explicitly pointed out as an open area of work [19,20,4], since an arbitrary k-value could generate sub-optimal assemblies. In these scenarios, the best k-values need to be found by exploring all the possible k-values [20].…”

Section: K-mer Based Ec Toolsmentioning

confidence: 99%

“…In Table 5, |w| refers to the word length selected for training and |V | is the vocabulary size of the data. The mean and standard deviation of the perplexity score has been calculated on the corrected data for k = 15,17,19,21,23,25,27,31,37,45, since beyond this value there was no change in the percentage of alignment for the corrected reads with the reference genome. Furthermore, a lower k-value is not usually recommended; in most cases, it is not supported by the error correction tool.…”

Section: Evaluation On Long Readsmentioning

confidence: 99%

Lerna: Transformer Architectures for Configuring Error Correction Tools for Short- and Long-Read Genome Sequencing

Sharma¹,

Jain²,

Mahgoub³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Background: Sequencing technologies are prone to errors, making error correction (EC) necessary for downstream applications. EC tools need to be manually configured for optimal performance. We find that the optimal parameters (e.g., k-mer size) are both tool-and dataset-dependent. Moreover, evaluating the performance (i.e., Alignment-rate or Gain) of a given tool usually relies on a reference genome, but quality reference genomes are not always available. We introduce Lerna for the automated configuration of k-mer-based EC tools. Lerna first creates a language model (LM) of the uncorrected genomic reads, and then, based on this LM, calculates a metric called the perplexity metric to evaluate the corrected reads for different parameter choices. Next, it finds the one that produces the highest alignment rate without using a reference genome. The fundamental intuition of our approach is that the perplexity metric is inversely correlated with the quality of the assembly after error correction. Therefore, Lerna leverages the perplexity metric for automated tuning of k-mer sizes without needing a reference genome.Results: First, we show that the best k-mer value can vary for different datasets, even for the same EC tool. This motivates our design that automates k-mer size selection without using a reference genome. Second, we show the gains of our LM using its component attention-based transformers. We show the model's estimation of the perplexity metric before and after error correction. The lower the perplexity after correction, the better the k-mer size. We also show that the alignment rate and assembly quality computed for the corrected reads are strongly negatively correlated with the perplexity, enabling the automated selection of k-mer values for better error correction, and hence, improved assembly quality. We validate our approach on both short and long reads. Additionally, we show that our attention-based models have significant runtime improvement for the entire pipeline -18× faster than previous works, due to parallelizing the attention mechanism and the use of JIT compilation for GPU inferencing. Conclusion:Lerna improves de novo genome assembly by optimizing EC tools. Our code is made available in a public repository at: https://github.com/icanforce/lerna-genomics.

show abstract

Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models

Cited by 10 publications

References 43 publications

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

Anomaly Detection and Inter-Sensor Transfer Learning on Smart Manufacturing Datasets

Lerna: Transformer Architectures for Configuring Error Correction Tools for Short- and Long-Read Genome Sequencing

Contact Info

Product

Resources

About