Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.
Insertion and Deletion (InDel) are common features in genomes and are associated with genetic variation. The whole-genome re-sequencing data from two parents (X1 and X2) of the elite cucumber (Cucumis sativus) hybrid variety Lvmei No.1 was used for genome-wide InDel polymorphisms analysis. Obtained sequence reads were mapped to the genome reference sequence of Chinese fresh market type inbred line ‘9930’ and gaps conforming to InDel were pinpointed. Further, the level of cross-parents polymorphism among five pairs of cucumber breeding parents and their corresponding hybrid varieties were used for evaluating hybrid seeds purity test efficiency of InDel markers. A panel of 48 cucumber breeding lines was utilized for PCR amplification versatility and phylogenetic analysis of these markers. In total, 10,470 candidate InDel markers were identified for X1 and X2. Among these, 385 markers with more than 30 nucleotide difference were arbitrary chosen. These markers were selected for experimental resolvability through electrophoresis on an Agarose gel. Two hundred and eleven (211) accounting for 54.81% of markers could be validated as single and clear polymorphic pattern while 174 (45.19%) showed unclear or monomorphic genetic bands between X1 and X2. Cross-parents polymorphism evaluation recorded 68 (32.23%) of these markers, which were designated as cross-parents transferable (CPT) InDel markers. Interestingly, the marker InDel114 presented experimental transferability between cucumber and melon. A panel of 48 cucumber breeding lines including parents of Lvmei No. 1 subjected to PCR amplification versatility using CPT InDel markers successfully clustered them into fruit and common cucumber varieties based on phylogenetic analysis. It is worth noting that 16 of these markers were predominately associated to enzymatic activities in cucumber. These agarose-based InDel markers could constitute a valuable resource for hybrid seeds purity testing, germplasm classification and marker-assisted breeding in cucumber.
Tobacco is one of the most important economic crops in China. The yield and quality of tobacco reduce severely because of long-time disease invasion. Currently, the main focus of researches on tobacco disease prevention and control is the diagnosis of disease that has occurred, which ignores to predict disease before it outbreaks. Therefore, in this paper, we follow the idea that prediction is used before disease prevention and control and study the model for tobacco disease prevention and control by using knowledge graph and case-based reasoning (CBR). In order to implement the model, we choose tobacco mosaic virus (TMV) as research object and follow the following methods to prevent occurrence of that. At first, a method to predicting environmental factors by using principal component analysis (PCA) and support vector machine (SVM) is proposed. According to the prediction result, knowledge graph and CBR are used to retrieve the most similarity case and finally determine the best solution. Experimental results demonstrate that our model can achieve high accuracy and give the most appropriate scheme for disease prevention and control.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.