SMOPredT4SE: An Effective Prediction of Bacterial Type IV Secreted Effectors Using SVM Training With SMO

Yan, Zihao; Chen, Dong; Teng, Zhaogang; Wang, Donghua; Li, Yanjuan

doi:10.1109/access.2020.2971091

Cited by 6 publications

(5 citation statements)

References 105 publications

(54 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…FY, et al 2019 Prediction, Association Rules BN, SVM, NB, DT Q2 Population BN 92.35%, RMSE 0.26 10-fold cross-validation temperature (min, ma, average), minimum humidity and rainfall 10.1109/BigDataCongress.2017.54 [52] The motivation behind this study is to provide a basic framework for biologists, which is based on big data analytics and deep learning models. Huaming Chen et al 2017 DL DL Q2, Q3 Proteomics protein–protein interaction 10.1109/ACCESS.2020.2971091 [48] SMOPredT4SE employed combination features of series correlation pseudo amino acid composition and position-specific scoring matrix to present protein sequences, and employed support vector machines (SVM) to identifying T4SEs Zihao Yan et al 2020 Prediction Classification SVM, RF, NB, kNN, Bagging, SGD, LibD3C. Q2, Q3 Proteomics 95.60% 5-fold cross-validation composed of 305 T4SEs and 610 non-T4SEs ** Notations : ML-Machine Learning, DM-Data Mining, support vector machines (SVM), and artificial neural networks (ANN), DT:-Decision Tree, RF:-Random Forest, GBR:-Generalized Boosted Regression, NB:-Naïve Bayes, SVM:-Support Vector Machine, KNN:-k-Nearest Neighbors, KM:-k-Means, NetA:-Network Analysis, RT:-Regression Tree, DNN:-Deep Neuron Networks, PN:-Phylogenetic Neighborhood, SVM-RFB-k:-SVM-RBF kernel, ANN:-Artificial Neural Network, DL:-Deep Learning, BRT:-Boosted Regression Tree, BN:-Bayes Network, GB:- Gradient Boosting, GrB:- Generalized Boosted, AdaBoost:-Adaptive Boosting, LR:- Logistic Regression, HD-LDA:- Hierarchical Divisive and Latent Dirichlet Allocation, GBMs:- Gradient Boosting Machines, RBF-t:- RBF tree, GB-t:- gradient boosted tree, SVM-RLK:- support vector machine (radial and linear kernel), CTA:- Classification Tree Analysis, RRF:- Regularized Random Forest, E-SVM:- Ensemble of three SVM, HA:- Hierarchical Agglomerative, C:- Clustering, GLMM:- Generalized Linear Mixed Models, SVM-Lk:- SVM-L kernel, Ens:- Ensemble, 2-L-SVM-E:- two-layer SVM-based ensemble model, CNN:- deep Convolutional Neural Network, ERT:- Extremely Randomized Trees (ERT), DL:- Deep Learning, MLP:-Multilayer Perceptron, XGB:- eXtreme Gradient Boosting, MC-SGE:- Meta-Classifiers (Stacked Generalized Ensemble).…”

Section: Resultsmentioning

confidence: 99%

“…The model accuracy range was 82–97%. The problems addressed included: the identification of high risk snail habitats as a function of Schistosoma japonicum infection [43] , modelling of tick bite risk based on ecological factors [44] , predicting the global distribution of Aedes mosquitoes and the effects of seasonal changes on their range [45] , [46] and the prediction of Dengue virus outbreak risk based on climate [47] , [48] .…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Assessment of vector-host-pathogen relationships using data mining and machine learning

Agany

Pietri

Gnimpieba

2020

Computational and Structural Biotechnology Journal

View full text Add to dashboard Cite

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Assessment of vector-host-pathogen relationships using data mining and machine learning

Agany

Pietri

Gnimpieba

2020

Computational and Structural Biotechnology Journal

View full text Add to dashboard Cite

show abstract

“…Where

is the frequency of occurrence of the 20 amino acids,

is the k-layer sequence correlation factor, and

is the weighting factor for sequence order effects,

= 0.05 in our study. The λ components can be defined by the user at will ( Yan et al, 2020 ). In this experiment, hydrophilic, hydrophobic, mass, pK1, pK2, pI, rigidity, flexibility, and irreplaceability are added, resulting in a 65-dimensional feature vector.…”

Section: Methodsmentioning

confidence: 99%

A method for identifying moonlighting proteins based on linear discriminant analysis and bagging-SVM

Chen

Guo

2022

Front. Genet.

View full text Add to dashboard Cite

Moonlighting proteins have at least two independent functions and are widely found in animals, plants and microorganisms. Moonlighting proteins play important roles in signal transduction, cell growth and movement, tumor inhibition, DNA synthesis and repair, and metabolism of biological macromolecules. Moonlighting proteins are difficult to find through biological experiments, so many researchers identify moonlighting proteins through bioinformatics methods, but their accuracies are relatively low. Therefore, we propose a new method. In this study, we select SVMProt-188D as the feature input, and apply a model combining linear discriminant analysis and basic classifiers in machine learning to study moonlighting proteins, and perform bagging ensemble on the best-performing support vector machine. They are identified accurately and efficiently. The model achieves an accuracy of 93.26% and an F-sorce of 0.946 on the MPFit dataset, which is better than the existing MEL-MP model. Meanwhile, it also achieves good results on the other two moonlighting protein datasets.

show abstract

“…Instead, a large number of computational methods have been developed for prediction of T4SEs in the last decade, which successfully speed up the process in terms of time and efficiency. These computational approaches can be categorized into two main groups: the first group of approaches infer new effectors based on sequence similarity with currently known effectors (Chen et al, 2010 ; Lockwood et al, 2011 ; Marchesini et al, 2011 ; Meyer et al, 2013 ; Sankarasubramanian et al, 2016 ; Noroy et al, 2019 ) or phylogenetic profiling analysis (Zalguizuri et al, 2019 ), and the second group of approaches involve learning the patterns of known secreted effectors that distinguish them from non-secreted proteins based on machine learning and deep learning techniques (Burstein et al, 2009 ; Lifshitz et al, 2013 ; Zou et al, 2013 ; Wang et al, 2014 ; Ashari et al, 2017 ; Wang Y. et al, 2017 ; Esna Ashari et al, 2018 , 2019a , b ; Guo et al, 2018 ; Xiong et al, 2018 ; Xue et al, 2018 ; Acici et al, 2019 ; Chao et al, 2019 ; Hong et al, 2019 ; Wang J. et al, 2019 ; Li J. et al, 2020 ; Yan et al, 2020 ). In the latter group of methods, Burstein et al ( 2009 ) worked on Legionella pneumophila to identify T4SEs and validated 40 novel effectors which were predicted by machine learning algorithms.…”

Section: Introductionmentioning

confidence: 99%

T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV Secreted Effectors Using eXtreme Gradient Boosting Algorithm

et al. 2020

View full text Add to dashboard Cite

Type IV secreted effectors (T4SEs) can be translocated into the cytosol of host cells via type IV secretion system (T4SS) and cause diseases. However, experimental approaches to identify T4SEs are time-and resource-consuming, and the existing computational tools based on machine learning techniques have some obvious limitations such as the lack of interpretability in the prediction models. In this study, we proposed a new model, T4SE-XGB, which uses the eXtreme gradient boosting (XGBoost) algorithm for accurate identification of type IV effectors based on optimal features based on protein sequences. After trying 20 different types of features, the best performance was achieved when all features were fed into XGBoost by the 5-fold cross validation in comparison with other machine learning methods. Then, the ReliefF algorithm was adopted to get the optimal feature set on our dataset, which further improved the model performance. T4SE-XGB exhibited highest predictive performance on the independent test set and outperformed other published prediction tools. Furthermore, the SHAP method was used to interpret the contribution of features to model predictions. The identification of key features can contribute to improved understanding of multifactorial contributors to host-pathogen interactions and bacterial pathogenesis. In addition to type IV effector prediction, we believe that the proposed framework can provide instructive guidance for similar studies to construct prediction methods on related biological problems. The data and source code of this study can be freely accessed at https://github.com/CT001002/T4SE-XGB.

show abstract

SMOPredT4SE: An Effective Prediction of Bacterial Type IV Secreted Effectors Using SVM Training With SMO

Cited by 6 publications

References 105 publications

Assessment of vector-host-pathogen relationships using data mining and machine learning

Assessment of vector-host-pathogen relationships using data mining and machine learning

A method for identifying moonlighting proteins based on linear discriminant analysis and bagging-SVM

T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV Secreted Effectors Using eXtreme Gradient Boosting Algorithm

Contact Info

Product

Resources

About