<i>K</i>-mer-based machine learning method to classify LTR-retrotransposons in plant genomes

Orozco-Arias, Simón; Candamil-Cortés, Mariana S.; Jaimes, Paula A.; Piña, Johan S.; Tabares-Soto, Reinel; Guyot, Romain; Isaza, Gustavo

doi:10.7717/peerj.11456

Cited by 14 publications

(4 citation statements)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Basically, a bunch of LTR-RT taken from InpactorDB [ 15 ] was randomly placed inside an entire DNA sequence with a fixed length of 50, 000 bp. The nucleotides filling the space between one LTR-RT and another corresponded to sequences that are known to not contain LTR-RT (negative data set taken from [ 45 ] DOI: 10.5281/zenodo.4543904 , See Methodology section). After the synthetic creation of DNA sequences, they were transformed into a one-hot 2D representation and they were used as features for training the CNN.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Genomic object detection: An improved approach for transposable elements detection and classification using convolutional neural networks

Orozco-Arias,

Lopez-Murillo,

Piña

et al. 2023

PLoS ONE

Self Cite

View full text Add to dashboard Cite

Analysis of eukaryotic genomes requires the detection and classification of transposable elements (TEs), a crucial but complex and time-consuming task. To improve the performance of tools that accomplish these tasks, Machine Learning approaches (ML) that leverage computer resources, such as GPUs (Graphical Processing Unit) and multiple CPU (Central Processing Unit) cores, have been adopted. However, until now, the use of ML techniques has mostly been limited to classification of TEs. Herein, a detection-classification strategy (named YORO) based on convolutional neural networks is adapted from computer vision (YOLO) to genomics. This approach enables the detection of genomic objects through the prediction of the position, length, and classification in large DNA sequences such as fully sequenced genomes. As a proof of concept, the internal protein-coding domains of LTR-retrotransposons are used to train the proposed neural network. Precision, recall, accuracy, F1-score, execution times and time ratios, as well as several graphical representations were used as metrics to measure performance. These promising results open the door for a new generation of Deep Learning tools for genomics. YORO architecture is available at https://github.com/simonorozcoarias/YORO.

show abstract

Section: Resultsmentioning

confidence: 99%

“…Create a synthetic DNA sequence of 50, 000 bp by concatenating sequences known to not include any LTR-RT (i.e coding sequences, different types of RNA like mRNA, tRNA, non-coding RNA, and other types of TEs such as TEs Class II) from [ 45 ] DOI: 10.5281/zenodo.4543904 . These sequences are called “negative background”.…”

Section: Methodsmentioning

confidence: 99%

Genomic object detection: An improved approach for transposable elements detection and classification using convolutional neural networks

Orozco-Arias,

Lopez-Murillo,

Piña

et al. 2023

PLoS ONE

Self Cite

View full text Add to dashboard Cite

show abstract

“…Due to the categorical nature of genomic data, this activity is crucial to be able to use ML models [ 36 ]. K -mers frequencies were used as features using 1 ≤ k ≤ 6 due to this approach seems to be useful for machine learning algorithms [ 37 ]. To this converted data set, scaling and dimension reduction techniques were applied using principal component analysis (PCA) with an explained variance of 96% (reduction of the initial number of features from 5460 to 2254).…”

Section: Methodsmentioning

confidence: 99%

Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning

Orozco-Arias

Candamil-Cortés

Jaimes

et al. 2022

Journal of Integrative Bioinformatics

View full text Add to dashboard Cite

Transposable elements are mobile sequences that can move and insert themselves into chromosomes, activating under internal or external stimuli, giving the organism the ability to adapt to the environment. Annotating transposable elements in genomic data is currently considered a crucial task to understand key aspects of organisms such as phenotype variability, species evolution, and genome size, among others. Because of the way they replicate, LTR retrotransposons are the most common transposable elements in plants, accounting in some cases for up to 80% of all DNA information. To annotate these elements, a reference library is usually created, a curation process is performed, eliminating TE fragments and false positives and then annotated in the genome using the homology method. However, the curation process can take weeks, requires extensive manual work and the execution of multiple time-consuming bioinformatics software. Here, we propose a machine learning-based approach to perform this process automatically on plant genomes, obtaining up to 91.18% F1-score. This approach was tested with four plant species, obtaining up to 93.6% F1-score (Oryza granulata) in only 22.61 s, where bioinformatics methods took approximately 6 h. This acceleration demonstrates that the ML-based approach is efficient and could be used in massive sequencing projects.

show abstract

“…In this work, we have focused on the development of a general and accurate method based on natural language text processing (NLP) and machine learning models to predict whether a protein sequence will exhibit an antifreeze property or not. We have used K -mer counting to extract different K -mer features from the protein sequences which has earlier been adopted by various studies to tackle many bioinformatics problems. − To the best of our knowledge, for the first time, NLP has been proposed to classify AFPs. We also employed the state-of-the-art explainability model, Shapley Additive eXplanations (SHAP), to gain insights into the outcomes produced by the machine learning models.…”

mentioning

confidence: 99%

Accurate Prediction of Antifreeze Protein from Sequences through Natural Language Text Processing and Interpretable Machine Learning Approaches

Dhibar,

Jana

2023

J. Phys. Chem. Lett.

View full text Add to dashboard Cite

K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes

Cited by 14 publications

References 66 publications

Genomic object detection: An improved approach for transposable elements detection and classification using convolutional neural networks

Genomic object detection: An improved approach for transposable elements detection and classification using convolutional neural networks

Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning

Accurate Prediction of Antifreeze Protein from Sequences through Natural Language Text Processing and Interpretable Machine Learning Approaches

Contact Info

Product

Resources

About