2019
DOI: 10.1038/s41598-019-52196-4
|View full text |Cite|
|
Sign up to set email alerts
|

Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Models

Abstract: The performance of most error-correction (EC) algorithms that operate on genomics reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction and consequently improve genome assembly. We perform this in an adaptive manner, adapted to different datasets and to EC tools, due to the observation that different configuration para… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
2

Relationship

2
6

Authors

Journals

citations
Cited by 10 publications
(13 citation statements)
references
References 43 publications
0
13
0
Order By: Relevance
“…The random forests hereby replace our hand-crafted conditions to decide whether a specific position in a read should be modified. This is in contrast to previous recent machine learning approaches like Athena [ 18 ] and Lerna [ 19 ] which try to find optimal input parameters for existing correction algorithms. Third, the algorithm has been optimized to reduce both runtime and memory consumption on both CPUs and GPUs.…”
Section: Introductionmentioning
confidence: 77%
“…The random forests hereby replace our hand-crafted conditions to decide whether a specific position in a read should be modified. This is in contrast to previous recent machine learning approaches like Athena [ 18 ] and Lerna [ 19 ] which try to find optimal input parameters for existing correction algorithms. Third, the algorithm has been optimized to reduce both runtime and memory consumption on both CPUs and GPUs.…”
Section: Introductionmentioning
confidence: 77%
“…Benchmark Data: We release our database corpus (4 datasets) and codes for the community to access it for anomaly detection and defect type classification and to build on it with new datasets and models. 2 We are unveiling real failures of a pharmaceutical packaging manufacturer company.…”
Section: Rpm Selection and Aggregationmentioning
confidence: 99%
“…• ML-based forecasting models: We selected six popular time series forecasting models, including Recurrent Neural Network (RNN) [45], LSTM [18] (which is a better version than RNN and has been used in different applications [2,19]), Deep Neural Netowrk (DNN) [40], AutoEncoder [12], and the recent works DeepAR [38], DeepFactors [52].…”
Section: Temporal Anomaly Detectionmentioning
confidence: 99%
“…Popular EC tools such as Lighter [18] and LoRDEC [11] for short-(<400 base pairs) and long-read sequences (>400 base pairs) respectively, require the user to select the k-value. Determining a favorable k-value among possible ones has been explicitly pointed out as an open area of work [19,20,4], since an arbitrary k-value could generate sub-optimal assemblies. In these scenarios, the best k-values need to be found by exploring all the possible k-values [20].…”
Section: K-mer Based Ec Toolsmentioning
confidence: 99%
“…In Table 5, |w| refers to the word length selected for training and |V | is the vocabulary size of the data. The mean and standard deviation of the perplexity score has been calculated on the corrected data for k = 15,17,19,21,23,25,27,31,37,45, since beyond this value there was no change in the percentage of alignment for the corrected reads with the reference genome. Furthermore, a lower k-value is not usually recommended; in most cases, it is not supported by the error correction tool.…”
Section: Evaluation On Long Readsmentioning
confidence: 99%