T4SE-XGB: interpretable sequence-based prediction of type IV secreted effectors using eXtreme gradient boosting algorithm

Chen, Tianhang; Wang, Xiangeng; Chu, Yanyi; Wei, Dong‐Qing; Xiong, Yi

doi:10.1101/2020.06.18.158253

Cited by 8 publications

(6 citation statements)

References 93 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In traditional gradient boosting, each new tree specifically focuses on the error of the previous tree. XGBoost adds more regularization terms in the model to control model over-fitting, which makes the model have a better performance ( Chen and Guestrin, 2016 ; Chen et al, 2020 ). In this study, “XGBClassifier” from “xgboost” library 7 was used for prediction.…”

Section: Methodsmentioning

confidence: 99%

Tracking Major Sources of Water Contamination Using Machine Learning

Song

Dubinsky

et al. 2021

Front. Microbiol.

View full text Add to dashboard Cite

Current microbial source tracking techniques that rely on grab samples analyzed by individual endpoint assays are inadequate to explain microbial sources across space and time. Modeling and predicting host sources of microbial contamination could add a useful tool for watershed management. In this study, we tested and evaluated machine learning models to predict the major sources of microbial contamination in a watershed. We examined the relationship between microbial sources, land cover, weather, and hydrologic variables in a watershed in Northern California, United States. Six models, including K-nearest neighbors (KNN), Naïve Bayes, Support vector machine (SVM), simple neural network (NN), Random Forest, and XGBoost, were built to predict major microbial sources using land cover, weather and hydrologic variables. The results showed that these models successfully predicted microbial sources classified into two categories (human and non-human), with the average accuracy ranging from 69% (Naïve Bayes) to 88% (XGBoost). The area under curve (AUC) of the receiver operating characteristic (ROC) illustrated XGBoost had the best performance (average AUC = 0.88), followed by Random Forest (average AUC = 0.84), and KNN (average AUC = 0.74). The importance index obtained from Random Forest indicated that precipitation and temperature were the two most important factors to predict the dominant microbial source. These results suggest that machine learning models, particularly XGBoost, can predict the dominant sources of microbial contamination based on the relationship of microbial contaminants with daily weather and land cover, providing a powerful tool to understand microbial sources in water.

show abstract

Section: Methodsmentioning

confidence: 99%

Tracking Major Sources of Water Contamination Using Machine Learning

Song

Dubinsky

et al. 2021

Front. Microbiol.

View full text Add to dashboard Cite

show abstract

“…CNN-T4SE integrated three Convolutional Neural Network models training the amino acid composition, solvent accessibility and secondary structure of full-length T4SEs, achieving better performance than other tools and lower false positive predictions [298] . Other groups adopted an alternative strategy, by selecting the best optimized features, and/or training and identifying the best machine learning models, to improve the prediction performance [299] , [149] , [150] , [151] . Some of the models have been well applied in identification of T4SEs in L. pneumophila [151] and Anaplasma phagocytophilum (OPT4e; [150] ).…”

Section: Outer Membrane and Two-membrane Spanning Secretion Systemsmentioning

confidence: 99%

Computational prediction of secreted proteins in gram-negative bacteria

Hui

Chen

Zhang

et al. 2021

Computational and Structural Biotechnology Journal

View full text Add to dashboard Cite

show abstract

“…Importance measure of every aligned position to the predictive performance on test sets of each model was calculated by the SHAP package, 41 which was frequently adopted to understand sequence-property relationship in proteins. [42][43][44] For every one-hot feature of an FP sequence that undergoes prediction, SHAP assigns an importance measure to the feature called the SHAP value. A positive SHAP value corresponds to a positive contribution of the feature value to the predicted target, while a higher SHAP value corresponds to a higher importance of the feature value to the prediction of the target.…”

Section: Feature Importance Calculationsmentioning

confidence: 99%

FPredX: Interpretable models for the prediction of spectral maxima, brightness, and oligomeric states of fluorescent proteins

Tam

Zhang

2021

Proteins

View full text Add to dashboard Cite

Fluorescent protein (FP) design is among the challenging protein design problems due to the tradeoffs among multiple properties to be optimized. Despite the accumulated efforts in design and characterization, progress has been slow in gaining a full understanding of sequence-property relationships to tackle the multiobjective design problem in FPs. In this study, we approach this problem by developing FPredX, a collection of gradient-boosted decision tree models, which mapped FP sequences to four major design targets of FPs, including excitation maximum, emission maximum, brightness, and oligomeric state. By training using one-hot encoded multiple aligned sequences with hyperparameters optimization in each model, FPredX models showed excellent prediction performance for all target properties compared with existing methods. We further interpreted the FPredX models by comparing the importance of positions along the aligned FP sequence to the predictive performance and suggested positions, which showed differential importance deemed by FPredX models to the prediction of each target property.

show abstract

T4SE-XGB: interpretable sequence-based prediction of type IV secreted effectors using eXtreme gradient boosting algorithm

Cited by 8 publications

References 93 publications

Tracking Major Sources of Water Contamination Using Machine Learning

Tracking Major Sources of Water Contamination Using Machine Learning

Computational prediction of secreted proteins in gram-negative bacteria

FPredX: Interpretable models for the prediction of spectral maxima, brightness, and oligomeric states of fluorescent proteins

Contact Info

Product

Resources

About