2011
DOI: 10.1371/journal.pone.0024756
|View full text |Cite
|
Sign up to set email alerts
|

iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model

Abstract: DNA-binding proteins play crucial roles in various cellular processes. Developing high throughput tools for rapidly and effectively identifying DNA-binding proteins is one of the major challenges in the field of genome annotation. Although many efforts have been made in this regard, further effort is needed to enhance the prediction power.By incorporating the features into the general form of pseudo amino acid composition that were extracted from protein sequences via the “grey model” and by adopting the rando… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
205
0

Year Published

2011
2011
2021
2021

Publication Types

Select...
8

Relationship

1
7

Authors

Journals

citations
Cited by 264 publications
(212 citation statements)
references
References 54 publications
0
205
0
Order By: Relevance
“…It is sequence-based method, in which the generated feature vector for protein sequence is based on the distance between residue pairs and has shown better performance for protein remote homology detection. "Distance Pair" method incorporates the amino acid distance pair coupling information and the amino acid reduced alphabet profile into the general pseudo amino acid composition (PseAAC) [108] vector, which is very useful for analysing DNA-binding proteins [15,170,189,275]. PDT is the abbreviation for "physicochemical distance transformation", which can incorporate considerable sequence-order information or important patterns of protein/peptide sequences into Pseudo components [28], which is very useful for conducting various proteome analyses [17, 23, 215-217, 224, 225, 231, 235, 276-289] and genome analysis as well [216,218,220,223,229,255,277,290].…”
Section: Category Modementioning
confidence: 99%
See 1 more Smart Citation
“…It is sequence-based method, in which the generated feature vector for protein sequence is based on the distance between residue pairs and has shown better performance for protein remote homology detection. "Distance Pair" method incorporates the amino acid distance pair coupling information and the amino acid reduced alphabet profile into the general pseudo amino acid composition (PseAAC) [108] vector, which is very useful for analysing DNA-binding proteins [15,170,189,275]. PDT is the abbreviation for "physicochemical distance transformation", which can incorporate considerable sequence-order information or important patterns of protein/peptide sequences into Pseudo components [28], which is very useful for conducting various proteome analyses [17, 23, 215-217, 224, 225, 231, 235, 276-289] and genome analysis as well [216,218,220,223,229,255,277,290].…”
Section: Category Modementioning
confidence: 99%
“…This is because almost all the existing machine-learning algorithms, such as "Neural Network" or NN algorithm [1][2][3] "Support Vector Machine" or SVM algorithm [4][5][6][7][8][9][10][11][12] "Nearest Neighbor" or NN algorithm [13,14] and "Random Forest" algorithm [15][16][17][18][19][20][21][22] can only handle vectors but not sequence samples as elucidated in a review paper [23]. Unfortunately, if using the sequential model, i.e., the model in which all the samples are represented by their original sequences, it is hardly able to train a machine learning model that can cover all the possible cases concerned, as elaborated in [24].…”
Section: Introductionmentioning
confidence: 99%
“…In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test (Chou and Zhang, 1995). However, as elucidated in Chou and Shen (2008) and demonstrated by Eqs.28-32 of Chou (2011), among the three cross-validation methods, the jackknife test is deemed the least arbitrary (most objective) that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used and widely recognized by investigators to examine the accuracy of various predictors (Georgiou et al, 2009;Zeng et al, 2009;Esmaeili et al, 2010;Mohabatkar, 2010;Qiu et al, 2010;Hu et al, 2011aHu et al, , 2011bHuang et al, 2011aHuang et al, , 2011bLin et al, 2011;Wang et al, 2011;Xiao et al, 2011). Accordingly, the jackknife test, also known as Leave-One-Out Cross-Validation (LOOCV) (Huang et al, 2008;Cai et al, 2010;Huang et al, 2009Huang et al, , 2010aHuang et al, , 2010b) was adopted here to examine the quality of the present predictor.…”
Section: Predictor Construction and Evaluationmentioning
confidence: 99%
“…Amino acid composition of proteins associated with the biochemical properties are the commonly used sequence-based features, for example Cai and Lin [1] used protein's amino acid composition, limited range correlation of hydrophobicity and solvent accessible surface area to identify DBPs; Ahmad et al [2] found the specificity of sequence level and binding level and analyzed the relationship between them; Fang et al [3] encoded the feature space by autocross-covariance (ACC) transform, pseudoamino acid composition, dipeptide composition; Zou et al [4] adopted three different feature transformation methods to generate numeric feature vectors from protein sequences; Lin et al [5] represented each sequence as pseudo amino acid composition by applied grey model. For more accurately predictive performance, the combinations of different features were employed, for example Kumar et al [6] derived sequence properties by frequency of amino acid, amino acid groups, secondary structure, comAbstract: Identification of DNA-binding proteins is an important problem in biomedical research as DNA-binding proteins are crucial for various cellular processes.…”
Section: Introductionmentioning
confidence: 99%