2021
DOI: 10.7717/peerj.11456
|View full text |Cite
|
Sign up to set email alerts
|

K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes

Abstract: Every day more plant genomes are available in public databases and additional massive sequencing projects (i.e., that aim to sequence thousands of individuals) are formulated and released. Nevertheless, there are not enough automatic tools to analyze this large amount of genomic information. LTR retrotransposons are the most frequent repetitive sequences in plant genomes; however, their detection and classification are commonly performed using semi-automatic and time-consuming programs. Despite the availabilit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 14 publications
(4 citation statements)
references
References 66 publications
0
4
0
Order By: Relevance
“…Basically, a bunch of LTR-RT taken from InpactorDB [ 15 ] was randomly placed inside an entire DNA sequence with a fixed length of 50, 000 bp. The nucleotides filling the space between one LTR-RT and another corresponded to sequences that are known to not contain LTR-RT (negative data set taken from [ 45 ] DOI: 10.5281/zenodo.4543904 , See Methodology section). After the synthetic creation of DNA sequences, they were transformed into a one-hot 2D representation and they were used as features for training the CNN.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Basically, a bunch of LTR-RT taken from InpactorDB [ 15 ] was randomly placed inside an entire DNA sequence with a fixed length of 50, 000 bp. The nucleotides filling the space between one LTR-RT and another corresponded to sequences that are known to not contain LTR-RT (negative data set taken from [ 45 ] DOI: 10.5281/zenodo.4543904 , See Methodology section). After the synthetic creation of DNA sequences, they were transformed into a one-hot 2D representation and they were used as features for training the CNN.…”
Section: Resultsmentioning
confidence: 99%
“…Create a synthetic DNA sequence of 50, 000 bp by concatenating sequences known to not include any LTR-RT (i.e coding sequences, different types of RNA like mRNA, tRNA, non-coding RNA, and other types of TEs such as TEs Class II) from [ 45 ] DOI: 10.5281/zenodo.4543904 . These sequences are called “negative background”.…”
Section: Methodsmentioning
confidence: 99%
“…Due to the categorical nature of genomic data, this activity is crucial to be able to use ML models [ 36 ]. K -mers frequencies were used as features using 1 ≤ k ≤ 6 due to this approach seems to be useful for machine learning algorithms [ 37 ]. To this converted data set, scaling and dimension reduction techniques were applied using principal component analysis (PCA) with an explained variance of 96% (reduction of the initial number of features from 5460 to 2254).…”
Section: Methodsmentioning
confidence: 99%
“…In this work, we have focused on the development of a general and accurate method based on natural language text processing (NLP) and machine learning models to predict whether a protein sequence will exhibit an antifreeze property or not. We have used K -mer counting to extract different K -mer features from the protein sequences which has earlier been adopted by various studies to tackle many bioinformatics problems. To the best of our knowledge, for the first time, NLP has been proposed to classify AFPs. We also employed the state-of-the-art explainability model, Shapley Additive eXplanations (SHAP), to gain insights into the outcomes produced by the machine learning models.…”
mentioning
confidence: 99%