iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model

Wei, Lin; Fang, Jian‐an; Xiao, Xuan; Chou, Kuo‐Chen

doi:10.1371/journal.pone.0024756

Cited by 264 publications

(212 citation statements)

References 54 publications

Supporting

Mentioning

205

Contrasting

Order By: Relevance

“…It is sequence-based method, in which the generated feature vector for protein sequence is based on the distance between residue pairs and has shown better performance for protein remote homology detection. "Distance Pair" method incorporates the amino acid distance pair coupling information and the amino acid reduced alphabet profile into the general pseudo amino acid composition (PseAAC) [108] vector, which is very useful for analysing DNA-binding proteins [15,170,189,275]. PDT is the abbreviation for "physicochemical distance transformation", which can incorporate considerable sequence-order information or important patterns of protein/peptide sequences into Pseudo components [28], which is very useful for conducting various proteome analyses [17, 23, 215-217, 224, 225, 231, 235, 276-289] and genome analysis as well [216,218,220,223,229,255,277,290].…”

Section: Category Modementioning

confidence: 99%

“…This is because almost all the existing machine-learning algorithms, such as "Neural Network" or NN algorithm [1][2][3] "Support Vector Machine" or SVM algorithm [4][5][6][7][8][9][10][11][12] "Nearest Neighbor" or NN algorithm [13,14] and "Random Forest" algorithm [15][16][17][18][19][20][21][22] can only handle vectors but not sequence samples as elucidated in a review paper [23]. Unfortunately, if using the sequential model, i.e., the model in which all the samples are represented by their original sequences, it is hardly able to train a machine learning model that can cover all the possible cases concerned, as elaborated in [24].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences

Liu¹,

Wu²,

Chou³

2017

Self Cite

132

View full text Add to dashboard Cite

Pse-in-One 2.0 is a package of web-servers evolved from Pse-in-One (Liu, B., Liu, F., Wang, X., Chen, J. Fang, L. & Chou, K.C. Nucleic Acids Research, 2015, 43:W65-W71). In order to make it more flexible and comprehensive as suggested by many users, the updated package has incorporated 23 new pseudo component modes as well as a series of new feature analysis approaches. It is available at http://bioinformatics.hitsz.edu.cn/Pse-in-One2.0/. Moreover, to maximize the convenience of users, provided is also the stand-alone version called "Pse-inOne-Analysis", by which users can significantly speed up the analysis of massive sequences.

show abstract

Section: Category Modementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences

Liu¹,

Wu²,

Chou³

2017

Self Cite

132

View full text Add to dashboard Cite

show abstract

“…In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test (Chou and Zhang, 1995). However, as elucidated in Chou and Shen (2008) and demonstrated by Eqs.28-32 of Chou (2011), among the three cross-validation methods, the jackknife test is deemed the least arbitrary (most objective) that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used and widely recognized by investigators to examine the accuracy of various predictors (Georgiou et al, 2009;Zeng et al, 2009;Esmaeili et al, 2010;Mohabatkar, 2010;Qiu et al, 2010;Hu et al, 2011aHu et al, , 2011bHuang et al, 2011aHuang et al, , 2011bLin et al, 2011;Wang et al, 2011;Xiao et al, 2011). Accordingly, the jackknife test, also known as Leave-One-Out Cross-Validation (LOOCV) (Huang et al, 2008;Cai et al, 2010;Huang et al, 2009Huang et al, , 2010aHuang et al, , 2010b) was adopted here to examine the quality of the present predictor.…”

Section: Predictor Construction and Evaluationmentioning

confidence: 99%

SySAP: a system-level predictor of deleterious single amino acid polymorphisms

et al. 2011

View full text Add to dashboard Cite

Single amino acid polymorphisms (SAPs), also known as non-synonymous single nucleotide polymorphisms (nsSNPs), are responsible for most of human genetic diseases. Discriminate the deleterious SAPs from neutral ones can help identify the disease genes and understand the mechanism of diseases. In this work, a method of deleterious SAP prediction at system level was established. Unlike most existing methods, our method not only considers the sequence and structure information, but also the network information. The integration of network information can improve the performance of deleterious SAP prediction. To make our method available to the public, we developed SySAP (a System-level predictor of deleterious Single Amino acid Polymorphisms), an easy-to-use and high accurate web server. SySAP is freely available at http://www.biosino.org/ SySAP/and http://lifecenter.sgst.cn/SySAP/.

show abstract

“…Amino acid composition of proteins associated with the biochemical properties are the commonly used sequence-based features, for example Cai and Lin [1] used protein's amino acid composition, limited range correlation of hydrophobicity and solvent accessible surface area to identify DBPs; Ahmad et al [2] found the specificity of sequence level and binding level and analyzed the relationship between them; Fang et al [3] encoded the feature space by autocross-covariance (ACC) transform, pseudoamino acid composition, dipeptide composition; Zou et al [4] adopted three different feature transformation methods to generate numeric feature vectors from protein sequences; Lin et al [5] represented each sequence as pseudo amino acid composition by applied grey model. For more accurately predictive performance, the combinations of different features were employed, for example Kumar et al [6] derived sequence properties by frequency of amino acid, amino acid groups, secondary structure, comAbstract: Identification of DNA-binding proteins is an important problem in biomedical research as DNA-binding proteins are crucial for various cellular processes.…”

Section: Introductionmentioning

confidence: 99%

PseDNA‐Pro: DNA‐Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation

Liu

Fan

et al. 2014

Molecular Informatics

156

113

View full text Add to dashboard Cite

Identification of DNA-binding proteins is an important problem in biomedical research as DNA-binding proteins are crucial for various cellular processes. Currently, the machine learning methods achieve the-state-of-the-art performance with different features. A key step to improve the performance of these methods is to find a suitable representation of proteins. In this study, we proposed a feature vector composed of three kinds of sequence-based features, including overall amino acid composition, pseudo amino acid composition (PseAAC) proposed by Chou and physicochemical distance transformation. These features not only consider the sequence composition of proteins, but also incorporate the sequence-order information of amino acids in proteins. The feature vectors were fed into Support Vector Machine (SVM) for DNA-binding protein identification. The proposed method is called PseDNA-Pro. Experiments on stringent benchmark datasets and independent test datasets by using the Jackknife test showed that PseDNA-Pro can achieve an accuracy of higher than 80 %, outperforming several state-of-the-art methods, including DNAbinder, DNA-Prot, and iDNA-Prot. These results indicate that the combination of various features for DNA-binding protein prediction is a suitable approach, and the sequence-order information among residues in proteins is relative for discrimination. For practical applications, a web-server of PseDNA-Pro was established, which is available from http://bioinformatics.hitsz.edu.cn/PseDNA-Pro/.

show abstract

iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model

Cited by 264 publications

References 54 publications

Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences

Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences

SySAP: a system-level predictor of deleterious single amino acid polymorphisms

PseDNA‐Pro: DNA‐Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation

Contact Info

Product

Resources

About