Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing

Sharma, Atul; Jain, Priyanka; Mahgoub, Ashraf; Zhou, Zihan; Mahadik, Kanak; Chaterji, Somali

doi:10.1186/s12859-021-04547-0

Cited by 6 publications

(7 citation statements)

References 65 publications

(87 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The random forests hereby replace our hand-crafted conditions to decide whether a specific position in a read should be modified. This is in contrast to previous recent machine learning approaches like Athena [ 18 ] and Lerna [ 19 ] which try to find optimal input parameters for existing correction algorithms. Third, the algorithm has been optimized to reduce both runtime and memory consumption on both CPUs and GPUs.…”

Section: Introductionmentioning

confidence: 76%

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

2022

View full text Add to dashboard Cite

Background Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. Results We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0’s hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. Conclusion False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE.

show abstract

Section: Introductionmentioning

confidence: 76%

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

2022

View full text Add to dashboard Cite

show abstract

“…Compared to k -mer-based approaches the performance is drastically lower. Naturally, with longer substrings, the vocabulay size increases allowing for more detailed pattern recognition ( 28 ).…”

Section: Resultsmentioning

confidence: 99%

“…Taxonomic classification tools are often based on alignment-free mapping approaches enabling the analysis of millions of reads with a high accuracy. Deep learning has shown its potential in a variety of problems on sequence-based data ( 12 , 13 , 28–30 ). Therefore, new classification approaches were explored using deep learning with good results on genus and species prediction even on non-curated databases ( 16 ).…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

MetaTransformer: deep metagenomic sequencing read classification using self-attention models

Wichmann,

Buschong,

Müller

et al. 2023

NAR Genomics and Bioinformatics

View full text Add to dashboard Cite

Deep learning has emerged as a paradigm that revolutionizes numerous domains of scientific research. Transformers have been utilized in language modeling outperforming previous approaches. Therefore, the utilization of deep learning as a tool for analyzing the genomic sequences is promising, yielding convincing results in fields such as motif identification and variant calling. DeepMicrobes, a machine learning-based classifier, has recently been introduced for taxonomic prediction at species and genus level. However, it relies on complex models based on bidirectional long short-term memory cells resulting in slow runtimes and excessive memory requirements, hampering its effective usability. We present MetaTransformer, a self-attention-based deep learning metagenomic analysis tool. Our transformer-encoder-based models enable efficient parallelization while outperforming DeepMicrobes in terms of species and genus classification abilities. Furthermore, we investigate approaches to reduce memory consumption and boost performance using different embedding schemes. As a result, we are able to achieve 2× to 5× speedup for inference compared to DeepMicrobes while keeping a significantly smaller memory footprint. MetaTransformer can be trained in 9 hours for genus and 16 hours for species prediction. Our results demonstrate performance improvements due to self-attention models and the impact of embedding schemes in deep learning on metagenomic sequencing data.

show abstract

“…This information and factors such as quality scores and genomic coverage contribute to the formulation of features used in training machine learning models. Furthermore, other machine learning methods specialize in identifying the optimal k -mer size essential for independent error correction tools [ 31 , 32 ] (see Table 1 ).…”

Section: Introductionmentioning

confidence: 99%

MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads

Sami,

El-Metwally,

Rashad

2024

BMC Bioinformatics

View full text Add to dashboard Cite

Background The rapid advancement of next-generation sequencing (NGS) machines in terms of speed and affordability has led to the generation of a massive amount of biological data at the expense of data quality as errors become more prevalent. This introduces the need to utilize different approaches to detect and filtrate errors, and data quality assurance is moved from the hardware space to the software preprocessing stages. Results We introduce MAC-ErrorReads, a novel Machine learning-Assisted Classifier designed for filtering Erroneous NGS Reads. MAC-ErrorReads transforms the erroneous NGS read filtration process into a robust binary classification task, employing five supervised machine learning algorithms. These models are trained on features extracted through the computation of Term Frequency-Inverse Document Frequency (TF_IDF) values from various datasets such as E. coli, GAGE S. aureus, H. Chr14, Arabidopsis thaliana Chr1 and Metriaclima zebra. Notably, Naive Bayes demonstrated robust performance across various datasets, displaying high accuracy, precision, recall, F1-score, MCC, and ROC values. The MAC-ErrorReads NB model accurately classified S. aureus reads, surpassing most error correction tools with a 38.69% alignment rate. For H. Chr14, tools like Lighter, Karect, CARE, Pollux, and MAC-ErrorReads showed rates above 99%. BFC and RECKONER exceeded 98%, while Fiona had 95.78%. For the Arabidopsis thaliana Chr1, Pollux, Karect, RECKONER, and MAC-ErrorReads demonstrated good alignment rates of 92.62%, 91.80%, 91.78%, and 90.87%, respectively. For the Metriaclima zebra, Pollux achieved a high alignment rate of 91.23%, despite having the lowest number of mapped reads. MAC-ErrorReads, Karect, and RECKONER demonstrated good alignment rates of 83.76%, 83.71%, and 83.67%, respectively, while also producing reasonable numbers of mapped reads to the reference genome. Conclusions This study demonstrates that machine learning approaches for filtering NGS reads effectively identify and retain the most accurate reads, significantly enhancing assembly quality and genomic coverage. The integration of genomics and artificial intelligence through machine learning algorithms holds promise for enhancing NGS data quality, advancing downstream data analysis accuracy, and opening new opportunities in genetics, genomics, and personalized medicine research.

show abstract

Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing

Cited by 6 publications

References 65 publications

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

MetaTransformer: deep metagenomic sequencing read classification using self-attention models

MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads

Contact Info

Product

Resources

About