2020
DOI: 10.3389/fgene.2020.00820
|View full text |Cite
|
Sign up to set email alerts
|

Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning

Abstract: Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
18
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 21 publications
(22 citation statements)
references
References 42 publications
1
18
0
1
Order By: Relevance
“…A recent study [87] used ensemble learning methods to identify orphan genes. In the study, it was found that XGBoost has better performance than RF and SVM.…”
Section: Discussionmentioning
confidence: 99%
“…A recent study [87] used ensemble learning methods to identify orphan genes. In the study, it was found that XGBoost has better performance than RF and SVM.…”
Section: Discussionmentioning
confidence: 99%
“…Taking into consideration the rice genome, in which 37 OGs were obtained under BLAST and BLAT (BLAST-Like Alignment Tool) programs (Jin et al, 2019). Other effective modules or programs include the SMOTE-ENN-XGBoost model (Synthetic Minority Over-sampling TEchnique-Edited Nearest Neighbors-eXtreme Gradient Boosting) (Gao et al, 2020), BIND (BRAK-ER-Inferred Directly), and MIND (MAKER-Inferred Directly) platforms (Li J. et al, 2021), ORFanFinder (Ekstrom and Yin, 2016), combined BLAST and Microarray-based genome hybridization methods (Li G. et al, 2019).…”
Section: Orphan Genes Identification and Its Fast Evolving Characteri...mentioning
confidence: 99%
“…We matched clinical information, laboratory indicators, disease phenotypes, and cell subpopulation data separately to construct five types of original data sets. Then, the use of the original data sets was compared and the oversampling and the algorithm of Syntic priority oversampling technology (SMOTE) methods (22) which was used to make up for the imbalance in the number of cases included in the data set on various models, including Lasso regression (LR) (23), Random Forest (RF) (22) and XGBoost (24). When applying the oversampling and SMOTE method, the (see Supplementary Table 1).…”
Section: Model Constructionmentioning
confidence: 99%