Detecting thermophilic proteins through selecting amino acid and dipeptide composition features

Nakariyakul, Songyot; Liu, Zhiping; Chen, Luonan

doi:10.1007/s00726-011-0923-1

Cited by 31 publications

(29 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lin et al constructed a dataset containing 915 thermophilic proteins and 793 non-thermophilic proteins, and predicted 93.8% thermophilic proteins and 92.7% nonthermophilic proteins using SVM. The same conclusion was also reached by Nakariyakul et al (2012), who obtained 93.3% identification accuracy in the same database used by Lin. In another study, Fan et al (2016) integrated information on the amino acid composition, evolution information, and acid dissociation constant to identify thermophiles by SVM, yielding an overall accuracy of 93.53%. Modarres et al (2018) proposed a new thermophilic protein database, which contained 14 million protein sequences.…”

Section: Introductionsupporting

confidence: 77%

A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

Feng

Dan

et al. 2020

Front. Bioeng. Biotechnol.

View full text Add to dashboard Cite

The thermostability of proteins is a key factor considered during enzyme engineering, and finding a method that can identify thermophilic and non-thermophilic proteins will be helpful for enzyme design. In this study, we established a novel method combining mixed features and machine learning to achieve this recognition task. In this method, an amino acid reduction scheme was adopted to recode the amino acid sequence. Then, the physicochemical characteristics, auto-cross covariance (ACC), and reduced dipeptides were calculated and integrated to form a mixed feature set, which was processed using correlation analysis, feature selection, and principal component analysis (PCA) to remove redundant information. Finally, four machine learning methods and a dataset containing 500 random observations out of 915 thermophilic proteins and 500 random samples out of 793 non-thermophilic proteins were used to train and predict the data. The experimental results showed that 98.2% of thermophilic and non-thermophilic proteins were correctly identified using 10-fold cross-validation. Moreover, our analysis of the final reserved features and removed features yielded information about the crucial, unimportant and insensitive elements, it also provided essential information for enzyme design.

show abstract

Section: Introductionsupporting

confidence: 77%

A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

Feng

Dan

et al. 2020

Front. Bioeng. Biotechnol.

View full text Add to dashboard Cite

show abstract

“…As shown in Table 7 , the ranks of the top-five amino acids to be TPPs (propensity, difference) for Glu, Lys, Val, Arg and Ile are (1, 1), (2, 2), (3, 3), (4, 4) and (5, 5), respectively, while the ranks of the top-five amino acids to be non-TPPs for Gln, Thr, Ala, Asn and Phe are (20, 20), (19, 18), (18, 19), (17, 16) and (16, 13), respectively. Many previous studies indicated that Glu, Lys and Arg had higher occurrence in TPPs than MPPs 20 , 27 , 28 , 35 , 52 – 55 . For example, Haney et al 53 conducted a comprehensive analysis on 115 protein sequences from M. jannaschii.…”

Section: Resultsmentioning

confidence: 92%

“…Several computational efforts based on machine learning (ML) methods have been made in recent years to identify TPPs 20 , 21 , 24 – 33 as summarized in Table 1 . As can be seen from Table 1 , support vector machine (SVM) method is the most widely used technique for identifying TPPs 20 , 21 , 24 – 26 , 28 – 30 . For instance, Zhang and Fan 31 developed the first TPP predictor based on amino acid composition (AAC) descriptors.…”

Section: Introductionmentioning

confidence: 99%

A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides

Charoenkwan

Chotpatiwetchkul

Lee

et al. 2021

Sci Rep

View full text Add to dashboard Cite

Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906–0.910) and 2–17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlabstack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs.

show abstract

“…They are relatively fast and unbiased in favor of a specific classifier. On the other hand, wrapper methods [10,11] use the performance of a classifier as the criterion function to assess the quality of a selected subset. The wrapper method generally achieves better classification performance than the filter method for the same number of selected genes, but it is also more time-consuming.…”

Section: Introductionmentioning

confidence: 99%

A hybrid gene selection algorithm based on interaction information for microarray-based cancer classification

Nakariyakul

2019

PLoS ONE

Self Cite

View full text Add to dashboard Cite

We address gene selection and machine learning methods for cancer classification using microarray gene expression data. Due to the high dimensionality of microarray data, traditional gene selection algorithms are filter-based, focusing on intrinsic properties of the data such as distance, dependency, and correlation. These methods are fast but select far too many genes to use for the classification task. In this work, we present a new hybrid filter-wrapper gene subset selection algorithm that is an improved modification of our prior algorithm. Our proposed method employs interaction information to rank candidate genes to add into a gene subset. It then conditionally adds one gene at a time into the current subset and verifies whether the resultant subset improves the classification performance significantly. Only significant genes are selected, and the candidate gene list is updated every time a gene is added to the subset. Thus, our gene selection algorithm is very dynamic. Experimental results on ten public cancer microarray data sets show that our method consistently outperforms prior gene selection algorithms in terms of classification accuracy, while requiring a small number of selected genes.

show abstract

Detecting thermophilic proteins through selecting amino acid and dipeptide composition features

Cited by 31 publications

References 33 publications

A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides

A hybrid gene selection algorithm based on interaction information for microarray-based cancer classification

Contact Info

Product

Resources

About