2020
DOI: 10.1155/2020/2468789
|View full text |Cite
|
Sign up to set email alerts
|

Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification

Abstract: Fungi play essential roles in many ecological processes, and taxonomic classification is fundamental for microbial community characterization and vital for the study and preservation of fungal biodiversity. To cope with massive fungal barcode data, tools that can implement extensive volumes of barcode sequences, especially the internal transcribed spacer (ITS) region, are necessary. However, high variation in the ITS region and computational requirements for processing high-dimensional features remain challeng… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 11 publications
(7 citation statements)
references
References 87 publications
0
7
0
Order By: Relevance
“…The application of representation learning to biological sequences is not new. For instance, word2vec, a widely used natural language processing (NLP) technique 48,49 , has been applied to obtain embeddings from sequences of the human genome 50 , as well as solve problems such as species identification 51,52 , methyladenosine site prediction 53,54 , and MHC binding site prediction 55,56 . However, word2vec is based on the encoding of local contextual information of sequence segments and have no provision for representing global information of entire sequences.…”
Section: Discussionmentioning
confidence: 99%
“…The application of representation learning to biological sequences is not new. For instance, word2vec, a widely used natural language processing (NLP) technique 48,49 , has been applied to obtain embeddings from sequences of the human genome 50 , as well as solve problems such as species identification 51,52 , methyladenosine site prediction 53,54 , and MHC binding site prediction 55,56 . However, word2vec is based on the encoding of local contextual information of sequence segments and have no provision for representing global information of entire sequences.…”
Section: Discussionmentioning
confidence: 99%
“…In this paper, we utilized various ensemble learning classification algorithms to develop identification models, which contain random forest ( Ru et al, 2019 ; Wang et al, 2020b ; Ao et al, 2021 ), AdaBoost, Gradient Boost Decision Tree ( Yu et al, 2020b ), LightGBM, and XGBoost. In addition, we also tried some traditional machine learning classification algorithms, such as logistic regression and Naïve Bayes.…”
Section: Methodsmentioning
confidence: 99%
“…Five commonly used metrics, ACC, Specificity (SP), Sensitivity (SN), Matthews correlation coefficient (MCC) and AUC [ 51 68 ] were used for model performance evaluation. They are calculated as follows: ACC=TP+TNTP+TN+FP+FNSN=TPTP+FN,SP=TNTN+FPMCC=TP×TNFP×FN(FP+TP)(FN+TP)(FP+TN)(FN+TN)…”
Section: Methodsmentioning
confidence: 99%